We’re examining how Linux coordinates system suspend: from the moment user space asks for sleep to the point the machine either powers down or aborts. The focal point is kernel/power/main.c in the Linux kernel, the core of the /sys/power interface. It turns simple text writes into orchestrated suspend and hibernate transitions, coordinates filesystems and workqueues, and records failures. I’m Mahmoud Zalt, an AI solutions architect, and we’ll use this file to follow one idea: good power management is really about disciplined coordination—across user space, kernel subsystems, and slow hardware like disks.
We’ll first look at how /sys/power is structured as a control panel, then trace the race-free path to sleep. From there we’ll zoom into filesystem sync and see how it can veto suspend, examine the suspend “black box” stats recorder, and end with concrete design patterns you can reuse in your own systems.
- The kernel’s power control panel
- Designing a race-free path to sleep
- When filesystem sync decides you don’t sleep
- A black box recorder for suspend failures
- Patterns you can reuse outside the kernel
The kernel’s power control panel
kernel/power/main.c is effectively the kernel’s power management control panel. It owns the /sys/power interface, the power-management notifier chain, PM workqueues, and a compact statistics recorder. User space talks to it using simple text files; the kernel responds by orchestrating complex transitions.
Project: linux
kernel/
power/
main.c <-- /sys/power core control & stats
power.h (globals like system_transition_mutex, pm_states)
suspend.c (pm_suspend(), pm_suspend_in_progress())
hibernate.c (hibernate(), hibernation_in_progress())
wakeup.c (pm_get_wakeup_count(), pm_save_wakeup_count())
autosleep.c (pm_autosleep_* APIs)
User space
|
+--> /sys/power/state, mem_sleep, autosleep, wakeup_count, ...
|
v
kernel/power/main.c
|
+--> PM notifiers (drivers, subsystems)
+--> Suspend/hibernate engines
+--> Filesystem sync via pm_fs_sync_wq
+--> Stats & debugfs (suspend_stats)
How main.c sits between user space and the rest of the PM stack.
At a high level, this file:
- Exposes sysfs “switches” like
/sys/power/state,mem_sleep,wakeup_count,autosleep,sync_on_suspend,freeze_filesystems, and several debug toggles. - Provides coordination APIs to other kernel code, such as
lock_system_sleep(), GFP mask helpers, and a global power-management notifier chain. - Synchronizes filesystems asynchronously before suspend.
- Records suspend/hibernate statistics in a compact “black box” structure.
This would look like a configuration module if you only saw the sysfs handlers. It becomes interesting when you see how those handlers cooperate to avoid races and data loss.
Think of /sys/power as a physical control panel with labeled buttons and LEDs. This file defines what each button means, which internal relays it flips, and how to ensure two buttons aren’t pressed in a dangerously conflicting way.
Designing a race-free path to sleep
With the control panel in place, the central question is: how do we press the “sleep” button safely while the world keeps generating wakeups? The main challenge in system sleep is races with wakeup events. A wakeup could arrive just as user space decides to suspend, and we must not lose it.
The state attribute: from string to transition
The human-facing entry point is /sys/power/state. It lists and accepts strings like freeze, mem, and disk. Internally, decode_state() translates those strings to a small enum:
static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr,
char *buf)
{
ssize_t count = 0;
#ifdef CONFIG_SUSPEND
suspend_state_t i;
for (i = PM_SUSPEND_MIN; i < PM_SUSPEND_MAX; i++)
if (pm_states[i])
count += sysfs_emit_at(buf, count, "%s ", pm_states[i]);
#endif
if (hibernation_available())
count += sysfs_emit_at(buf, count, "disk ");
if (count > 0)
buf[count - 1] = '\n';
return count;
}
static suspend_state_t decode_state(const char *buf, size_t n)
{
#ifdef CONFIG_SUSPEND
suspend_state_t state;
#endif
char *p;
int len;
p = memchr(buf, '\n', n);
len = p ? p - buf : n;
if (len == 4 && str_has_prefix(buf, "disk"))
return PM_SUSPEND_MAX;
#ifdef CONFIG_SUSPEND
for (state = PM_SUSPEND_MIN; state < PM_SUSPEND_MAX; state++) {
const char *label = pm_states[state];
if (label && len == strlen(label) && !strncmp(buf, label, len))
return state;
}
#endif
return PM_SUSPEND_ON;
}
This illustrates a valuable pattern: translate text into a small, closed enum. Unknown inputs map to a safe default (PM_SUSPEND_ON, “don’t sleep”), and hibernation is treated as a special sentinel (PM_SUSPEND_MAX).
The real work happens in state_store():
static ssize_t state_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t n)
{
suspend_state_t state;
int error;
error = pm_autosleep_lock();
if (error)
return error;
if (pm_autosleep_state() > PM_SUSPEND_ON) {
error = -EBUSY;
goto out;
}
state = decode_state(buf, n);
if (state < PM_SUSPEND_MAX) {
if (state == PM_SUSPEND_MEM)
state = mem_sleep_current;
error = pm_suspend(state);
} else if (state == PM_SUSPEND_MAX) {
error = hibernate();
} else {
error = -EINVAL;
}
out:
pm_autosleep_unlock();
return error ? error : n;
}
Two coordination decisions dominate here:
-
Autosleep lock :
pm_autosleep_lock()guarantees a manual suspend viastatedoesn’t race with ongoing autosleep activity. If autosleep is already active, we return-EBUSY. -
Platform mapping : The generic
memstate is translated tomem_sleep_current, which hides platform-specific choices like s2idle vs deep sleep. The handler doesn’t embed policy. It defers to small helpers (decode_state,pm_autosleep_lock,pm_suspend,hibernate). The top-level flow stays readable: parse → validate → call. ### The wakeup ticket system:wakeup_count
Parsing state strings isn’t enough to be safe. We still need: what if a wakeup arrives while user space is preparing to sleep? For that, Linux uses the wakeup_count protocol, exported as another sysfs attribute.
A useful mental model is a numbered ticket:
- User space reads the current ticket from
/sys/power/wakeup_count. - It does its preparations.
- It writes the same ticket back to
wakeup_count. - If a wakeup arrived in the meantime, the kernel refuses the write; suspend should not proceed.
static ssize_t wakeup_count_show(struct kobject *kobj,
struct kobj_attribute *attr,
char *buf)
{
unsigned int val;
return pm_get_wakeup_count(&val, true) ?
sysfs_emit(buf, "%u\n", val) : -EINTR;
}
static ssize_t wakeup_count_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t n)
{
unsigned int val;
int error;
error = pm_autosleep_lock();
if (error)
return error;
if (pm_autosleep_state() > PM_SUSPEND_ON) {
error = -EBUSY;
goto out;
}
error = -EINVAL;
if (sscanf(buf, "%u", &val) == 1) {
if (pm_save_wakeup_count(val))
error = n;
else
pm_print_active_wakeup_sources();
}
out:
pm_autosleep_unlock();
return error;
}
User space and kernel agree on a very narrow contract: a single monotonic counter and a return code. That’s enough to avoid a class of subtle suspend vs. wakeup races, as long as user space follows the documented protocol.
This is a clear example of solving a hard concurrency problem with a tiny, explicit protocol instead of timing heuristics.
When filesystem sync decides you don’t sleep
Even with a race-free sleep handshake, there’s another high-stakes decision: do we trust the filesystem state right now? Powering down with lots of dirty data risks slower resume, inconsistent state, or worse if something crashes. That’s where pm_sleep_fs_sync() comes in—and where a filesystem sync can veto your sleep.
Asynchronous sync with a wakeup-aware escape hatch
Instead of blocking the caller in ksys_sync(), the PM core offloads the heavy work to a dedicated workqueue and coordinates using an atomic counter and a wait queue:
static bool pm_fs_sync_completed(void)
{
return atomic_read(&pm_fs_sync_count) == 0;
}
static void pm_fs_sync_work_fn(struct work_struct *work)
{
ksys_sync_helper();
if (atomic_dec_and_test(&pm_fs_sync_count))
wake_up(&pm_fs_sync_wait);
}
static DECLARE_WORK(pm_fs_sync_work, pm_fs_sync_work_fn);
int pm_sleep_fs_sync(void)
{
pm_wakeup_clear(0);
if (!work_pending(&pm_fs_sync_work)) {
atomic_inc(&pm_fs_sync_count);
queue_work(pm_fs_sync_wq, &pm_fs_sync_work);
}
while (!pm_fs_sync_completed()) {
if (pm_wakeup_pending())
return -EBUSY;
wait_event_timeout(pm_fs_sync_wait, pm_fs_sync_completed(),
PM_FS_SYNC_WAKEUP_RESOLUTION);
}
return 0;
}
Several coordination decisions are packed into this small function:
-
Decoupled work : The heavyweight
ksys_sync_helper()call lives inpm_fs_sync_work_fn(), running onpm_fs_sync_wq. The caller ofpm_sleep_fs_sync()only cares whether sync finished or was aborted. -
Back-to-back suspend handling : Before queueing work, it checks
work_pending(). If a sync is already in flight, it reuses that work rather than enqueueing parallel syncs. -
Wakeup-aware waiting : The loop polls
pm_wakeup_pending()before each timed wait. If a wakeup appears, the function exits with-EBUSY, signaling higher-level suspend logic to abort or retry. This pattern—start heavy work on a workqueue, then wait in small timed steps while checking a cancellation condition—is a reusable recipe for any operation that must abort quickly when the world changes.
This is where the title becomes literal: as long as the filesystem sync is in progress, suspend is effectively on hold. If a wakeup happens first, pm_sleep_fs_sync() relinquishes control and refuses to declare success. The decision to sleep or not is coordinated across storage safety and event activity, not just a naive “call sync then sleep”.
Boot-time wiring: workqueues before knobs
This syncing machinery depends on PM-specific workqueues created at boot:
struct workqueue_struct *pm_wq;
EXPORT_SYMBOL_GPL(pm_wq);
static int __init pm_start_workqueues(void)
{
pm_wq = alloc_workqueue("pm", WQ_FREEZABLE | WQ_UNBOUND, 0);
if (!pm_wq)
return -ENOMEM;
#if defined(CONFIG_SUSPEND) || defined(CONFIG_HIBERNATION)
pm_fs_sync_wq = alloc_ordered_workqueue("pm_fs_sync", 0);
if (!pm_fs_sync_wq) {
destroy_workqueue(pm_wq);
return -ENOMEM;
}
#endif
return 0;
}
static int __init pm_init(void)
{
int error = pm_start_workqueues();
if (error)
return error;
hibernate_image_size_init();
hibernate_reserved_size_init();
pm_states_init();
power_kobj = kobject_create_and_add("power", NULL);
if (!power_kobj)
return -ENOMEM;
error = sysfs_create_groups(power_kobj, attr_groups);
if (error)
return error;
pm_print_times_init();
return pm_autosleep_init();
}
core_initcall(pm_init);
Initialization itself is structured as coordination:
- First, start the workqueues suspend depends on.
- Then initialize global PM and hibernation state.
- Only then create the
powerkobject and attach attribute groups, so user space sees a coherent, working control surface.
A black box recorder for suspend failures
Even with careful coordination, suspend flows do fail—because of drivers, firmware, or configuration. To debug those failures, main.c includes a compact statistics recorder: suspend_stats. Conceptually, it’s a flight recorder for sleep attempts.
#define SUSPEND_NR_STEPS SUSPEND_RESUME
#define REC_FAILED_NUM 2
struct suspend_stats {
unsigned int step_failures[SUSPEND_NR_STEPS];
unsigned int success;
unsigned int fail;
int last_failed_dev;
char failed_devs[REC_FAILED_NUM][40];
int last_failed_errno;
int errno[REC_FAILED_NUM];
int last_failed_step;
u64 last_hw_sleep;
u64 total_hw_sleep;
u64 max_hw_sleep;
enum suspend_stat_step failed_steps[REC_FAILED_NUM];
};
static struct suspend_stats suspend_stats;
static DEFINE_MUTEX(suspend_stats_lock);
void dpm_save_failed_dev(const char *name)
{
mutex_lock(&suspend_stats_lock);
strscpy(suspend_stats.failed_devs[suspend_stats.last_failed_dev],
name, sizeof(suspend_stats.failed_devs[0]));
suspend_stats.last_failed_dev++;
suspend_stats.last_failed_dev %= REC_FAILED_NUM;
mutex_unlock(&suspend_stats_lock);
}
void dpm_save_failed_step(enum suspend_stat_step step)
{
suspend_stats.step_failures[step - 1]++;
suspend_stats.failed_steps[suspend_stats.last_failed_step] = step;
suspend_stats.last_failed_step++;
suspend_stats.last_failed_step %= REC_FAILED_NUM;
}
void dpm_save_errno(int err)
{
if (!err) {
suspend_stats.success++;
return;
}
suspend_stats.fail++;
suspend_stats.errno[suspend_stats.last_failed_errno] = err;
suspend_stats.last_failed_errno++;
suspend_stats.last_failed_errno %= REC_FAILED_NUM;
}
This structure encodes several deliberate tradeoffs:
-
Tiny ring buffers : For failed devices, errno values, and steps, it uses fixed-size ring buffers (
REC_FAILED_NUM= 2) indexed modulo N. The goal isn’t full history, just “what failed recently?” -
Selective locking : Only
dpm_save_failed_dev()takessuspend_stats_lock. Other writers update counters lockless. For diagnostics, a small chance of inconsistent cross-fields is acceptable if it keeps the recorder cheap. -
Structured failure context :
step_failures,failed_steps,failed_devs, anderrnocombine to answer “which phase failed, on which device, and with which error?” This is a case of fit-for-purpose consistency. For billing, you’d want precise, strongly consistent updates. For debug stats, “approximately correct and always cheap” wins.
These statistics are then surfaced in two styles:
- Sysfs under
/sys/power/suspend_stats/..., with hardware sleep timing fields gated on ACPI low-power S0 support. - Debugfs as
/sys/kernel/debug/suspend_stats, a multi-line human-readable summary.
The separation between machine-friendly (one value per file) and human-friendly (rich text) views is another coordination decision: observability for tooling vs. usability for humans.
Patterns you can reuse outside the kernel
Although kernel/power/main.c is deep in kernel space, the patterns it uses are broadly applicable. The common thread is disciplined coordination —treating mode transitions as protocols rather than ad-hoc sequences. Four patterns stand out:
-
Model commands as enums, not raw strings.
decode_state()and related helpers turn free-form text into a closed set of internal states, with safe defaults for unknown input. In your APIs, treat user-specified modes the same way: parse to an enum early, then switch on that. -
Use explicit handshakes to avoid races.
The
wakeup_countprotocol is effectively a compare-and-swap between user space and kernel: “sleep only if the counter is still X.” Any multi-actor workflow—deployments, job scheduling, leases—can benefit from a similar ticket or version counter instead of relying on timing assumptions. -
Offload heavy work, but keep a fast abort path.
pm_sleep_fs_sync()queues heavy I/O to a workqueue and then waits in small intervals while checking for wakeups. Long-running tasks in your services (rebuilds, compactions, background jobs) can follow this template so that configuration changes, leadership changes, or cancellations take effect promptly. -
Record just enough structured history to debug.
suspend_statsdoesn’t log everything; it keeps a tiny, structured “last N failures” ring plus counters. For many systems, a small, well-designed error recorder is more actionable (and safer) than unbounded logging.
Along the way we saw how filesystem sync, wakeup handshakes, workqueues, and statistics all come together to decide whether the system actually sleeps. The primary lesson is that robust power management is less about individual syscalls and more about coordinating stateful components through clear protocols and carefully ordered steps.
When you design systems that switch modes under load—rolling deploys, blue/green cutovers, maintenance drains—you can approach them the same way kernel/power/main.c approaches suspend: define narrow contracts, make races impossible by protocol, offload heavy work but stay abortable, and record just enough to understand failures later.
Top comments (0)