We’re going to dissect how Linux decides which task gets the next slice of CPU time. The code lives in kernel/sched/core.c in the Linux kernel, which coordinates all the per-class schedulers (CFS, RT, deadline, idle, stop, and BPF-based extensions). I’m Mahmoud Zalt, an AI solutions architect, and we’ll use this file as a case study in how to design a complex, high-performance scheduler without losing control of correctness.
Our focus is one core idea: separate the lifecycle of work (blocking and waking) from the policy that selects what runs next, then glue them together with explicit state and clear invariants. Everything that follows—schedule, try_to_wake_up, core scheduling, and tick handling—is an application of that idea under extreme concurrency.
- The core loop:
schedule - Waking tasks safely:
try_to_wake_up - Sharing cores securely: core scheduling
- Keeping the scheduler honest: ticks and metrics
- Design lessons for your own systems
Scheduler as orchestrator
To build a mental model, treat each CPU as a runway and each runnable task as a plane waiting to take off. The scheduler’s job is to:
- Keep each runway busy without collisions (one running task per CPU).
- Honor different flight classes: real-time, deadline, fair (CFS), background, and special “stop” tasks.
- Handle rerouting (migration across CPUs) when constraints or topology change.
- Enforce airspace constraints: cgroups, utilization clamping, quotas, NUMA, and security policies.
Project (linux)
└── kernel/
└── sched/
├── core.c <- this file: main scheduler control flow
├── fair.c (CFS scheduler class)
├── rt.c (real-time scheduler class)
├── deadline.c (deadline scheduler class)
├── idle.c (idle scheduler class)
├── stop_task.c (stop scheduler class)
├── sched.h (common scheduler declarations)
├── pelt.c (load tracking)
├── autogroup.c (autogrouping)
└── stats.c (sched stats helpers)
Call graph (simplified):
schedule / preempt_schedule
|
v
__schedule
|
+--> try_to_block_task (maybe)
|
+--> pick_next_task (core or non-core)
| |
| v
| sched_class->pick_next_task / pick_task
|
+--> context_switch
| |
| v
| switch_mm / switch_to / finish_task_switch
|
v
next task runs
try_to_wake_up
|
+--> p->pi_lock, state checks
+--> ttwu_runnable (fast path if on_rq)
+--> select_task_rq
+--> ttwu_queue (rq lock or wakelist)
+--> resched_curr / send IPI
<figcaption>
<code>core.c</code> sits between the per-class schedulers and the rest of the kernel, orchestrating who runs where and when.
</figcaption>
This file is a masterclass in coordinating many moving parts around a single responsibility: choosing the next task on each CPU, safely, under extreme concurrency.
Rule of thumb: When a subsystem touches timers, cgroups, IRQs, hotplug, and security, you only stay sane by enforcing strong invariants and clear locking rules. The Linux scheduler does this relentlessly.
The core loop: __ schedule
With the “control tower” analogy in mind, the first question is: what does the main decision loop look like? In Linux, that loop is __schedule. It’s called from schedule(), preemption paths, and block/yield sites, and its job is to:
- Decide whether the current task should stop running (block or keep going).
- Pick the next task for this CPU, possibly proxy-executing on behalf of a blocked owner.
- Perform the context switch while preserving scheduler invariants.
Here is a simplified but real excerpt:
static void sched notrace schedule(int sched_mode)
{
struct task_struct *prev, *next;
bool preempt = sched_mode > SM_NONE;
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
trace_sched_entry_tp(sched_mode == SM_PREEMPT);
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
local_irq_disable();
rcu_note_context_switch(preempt);
migrate_disable_switch(rq, prev);
rq_lock(rq, &rf);
smp_mb__after_spinlock();
update_rq_clock(rq);
preempt = sched_mode == SM_PREEMPT;
prev_state = READ_ONCE(prev->__state);
if (sched_mode == SM_IDLE) {
if (!rq->nr_running && !scx_enabled()) {
next = prev;
goto picked;
}
} else if (!preempt && prev_state) {
try_to_block_task(rq, prev, &prev_state,
!task_is_blocked(prev));
}
pick_again:
next = pick_next_task(rq, rq->donor, &rf);
rq_set_donor(rq, next);
if (unlikely(task_is_blocked(next))) {
next = find_proxy_task(rq, next, &rf);
if (!next)
goto pick_again;
if (next == rq->idle)
goto keep_resched;
}
picked:
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
/* context_switch() or stay on prev */
}
The structure illustrates the central design principle of this file: separate lifecycle decisions from selection decisions, and make each phase explicit.
1. Lifecycle first: “should we block?”
Before deciding who runs next, __schedule decides what happens to the current task. That is all about task state transitions and accounting:
- Moving from runnable to sleeping (or back).
- Maintaining
on_rqand load statistics. - Handling special modes like idle scheduling.
That logic is concentrated in helpers like try_to_block_task, which operate entirely within the “lifecycle” domain. Only after this phase does the scheduler move on to picking the next task.
Takeaway: Any time a function both mutates lifecycle state and performs a complex selection, split those concerns into clearly separated phases. Even in hot code, a static inline helper for lifecycle decisions makes correctness reviews much easier.
2. Policy second: pluggable “pick next task” strategy
Once lifecycle updates are done, __schedule calls into pick_next_task. That function is a meta-scheduler: it doesn’t know how CFS trees work or how RT priority queues are structured. It just orchestrates between scheduler classes via a small vtable.
static inline struct task_struct *
__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
/* Fast path: only fair tasks runnable */
if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&
rq->nr_running == rq->cfs.h_nr_queued)) {
p = pick_next_task_fair(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
goto restart;
if (!p) {
p = pick_task_idle(rq, rf);
put_prev_set_next_task(rq, prev, p);
}
return p;
}
restart:
prev_balance(rq, prev, rf);
for_each_active_class(class) {
if (class->pick_next_task) {
p = class->pick_next_task(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
goto restart;
if (p)
return p;
} else {
p = class->pick_task(rq, rf);
if (unlikely(p == RETRY_TASK))
goto restart;
if (p) {
put_prev_set_next_task(rq, prev, p);
return p;
}
}
}
BUG(); /* idle class must always have something */
}
The core loop is simple:
- Fast path: if only CFS tasks are runnable, delegate directly to
pick_next_task_fair. - Otherwise, iterate over active scheduler classes in priority order, asking each one for a candidate.
- Handle special return values like
RETRY_TASKto indicate that balancing changed the picture and selection should restart.
Even at this level, the pattern is clear: lifecycle changes are contained, selection is delegated through a narrow interface, and the core control flow stays readable despite being performance-critical.
Waking tasks safely: try_to_wake_up
Choosing the next runnable task is only half the job. The other half is getting sleeping tasks back into the runnable set without violating invariants. That is the domain of try_to_wake_up, one of the most intricate functions in core.c.
If __schedule is the control tower, try_to_wake_up is the postal service routing wakeup “letters” to the right runqueue under heavy concurrency.
Fast path: waking a task that’s already runnable
Linux heavily optimizes the case where a task is already on a runqueue (for example, preempted but still runnable). Instead of fully re-enqueueing it, the kernel updates accounting and maybe preempts the current task. That logic lives in ttwu_runnable:
static int ttwu_runnable(struct task_struct *p, int wake_flags)
{
struct rq_flags rf;
struct rq *rq;
int ret = 0;
rq = __task_rq_lock(p, &rf);
if (task_on_rq_queued(p)) {
update_rq_clock(rq);
if (p->se.sched_delayed)
enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
if (!task_on_cpu(rq, p))
wakeup_preempt(rq, p, wake_flags);
ttwu_do_wakeup(p);
ret = 1;
}
__task_rq_unlock(rq, p, &rf);
return ret;
}
The structure mirrors the lifecycle/selection split:
- Acquire the runqueue lock that owns
pviatask_rq_lock. - If
pis already queued, update runqueue accounting and potentially re-enqueue delayed work. - If
pis not currently executing, consult policy (wakeup_preempt) to see if it should preempt the current task. - Mark the lifecycle state as runnable (
ttwu_do_wakeupwritesp->state) and unlock.
The heavy lifting is in how this fast path cooperates with the full try_to_wake_up path, which must preserve a tight state machine.
Rule of thumb: If your wakeup path shares state with your blocking path, design an explicit state machine with separate fields and documented transitions. Linux uses __state, on_rq, and on_cpu with comments and memory barriers instead of relying on implicit invariants.
Asynchronous wakeups via wakelists
Waking tasks on remote CPUs risks cross-CPU contention if you grab other CPUs’ runqueue locks directly. To avoid that in the hot path, the scheduler can enqueue a wakeup into a remote CPU’s wakelist and let that CPU process it under its own lock:
static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
{
struct rq *rq = cpu_rq(cpu);
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
WRITE_ONCE(rq->ttwu_pending, 1);
ifdef CONFIG_SMP
__smp_call_single_queue(cpu, &p->wake_entry.llist);
endif
}
The remote CPU drains these entries in sched_ttwu_pending(), under its own rq lock. The net effect is:
- Wakeups are logically initiated by any CPU, but physically applied by the CPU that owns the runqueue.
- Callers never need to grab two runqueue locks at once in the common case.
For any sharded system—per-CPU runqueues, per-partition queues, distributed shards—this pattern is gold: ship work to the shard owner instead of mutating remote shard state synchronously.
Sharing cores securely: core scheduling
On SMT systems, multiple logical CPUs share a physical core. That shared hardware can leak side channels when mutually untrusted tasks run concurrently on sibling threads. Linux’s core scheduling machinery in core.c treats a core as a single “stage” and uses cookies to decide which tasks are allowed to share it.
Conceptually:
- Each task may have a
core_cookie(think of it as a color). - Only tasks with the same cookie are allowed to run on sibling threads of the same core at the same time.
- If no matching cookie is available, the core may force idle an SMT sibling to preserve isolation.
Ordering by cookie and priority
Core scheduling maintains a per-core RB-tree of runnable tasks ordered by cookie and an internal priority value that squashes the rich class hierarchy into a single integer:
/* kernel prio, less is more */
static inline int __task_prio(const struct task_struct *p)
{
if (p->sched_class == &stop_sched_class)
return -2;
if (p->dl_server)
return -1; /* deadline */
if (rt_or_dl_prio(p->prio))
return p->prio; /* [-1, 99] */
if (p->sched_class == &idle_sched_class)
return MAX_RT_PRIO + NICE_WIDTH; /* 140 */
if (task_on_scx(p))
return MAX_RT_PRIO + MAX_NICE + 1; /* 120, squash ext */
return MAX_RT_PRIO + MAX_NICE; /* 119, squash fair */
}
void sched_core_enqueue(struct rq *rq, struct task_struct *p)
{
if (p->se.sched_delayed)
return;
rq->core->core_task_seq++;
if (!p->core_cookie)
return;
rb_add(&p->core_node, &rq->core_tree, rb_sched_core_less);
}
This gives core scheduling a uniform way to compare tasks across classes (stop, deadline, RT, fair, idle, ext) while still honoring the policy encoded in each class. Again, lifecycle (enqueue/dequeue) is separate from policy (priority ordering and cookie matching).
Analogy: Cookies are colored wristbands. Only performers with the same color can share the stage. The RB-tree is the sorted waiting list, ordered first by color, then by “importance.”
Locking runqueues with core scheduling enabled
Core scheduling complicates locking because multiple logical CPUs in a core can map to a shared underlying lock. Rather than exposing that everywhere, the scheduler uses indirection in the runqueue lock helpers:
void raw_spin_rq_lock_nested(struct rq *rq, int subclass)
{
raw_spinlock_t *lock;
/* Matches synchronize_rcu() in __sched_core_enable() */
preempt_disable();
if (sched_core_disabled()) {
raw_spin_lock_nested(&rq->__lock, subclass);
preempt_enable_no_resched();
return;
}
for (;;) {
lock = __rq_lockp(rq);
raw_spin_lock_nested(lock, subclass);
if (likely(lock == __rq_lockp(rq))) {
preempt_enable_no_resched();
return;
}
raw_spin_unlock(lock);
}
}
The pattern is simple but powerful:
- Disable preemption so the lock pointer can’t change under our feet.
- Resolve the “real” spinlock for this runqueue with
rq_lockp(rq). - Take that lock and re-check that
rq_lockp(rq)still points to the same lock; if not, drop and retry.
This is another application of the central theme: keep policy and mapping logic behind helpers. Locking code doesn’t know about core scheduling details; it just calls into an indirection layer that can evolve without touching every call site.
Keeping the scheduler honest: ticks and metrics
All of this structure only matters if the system stays healthy under real load. The scheduler’s periodic tick and its exported metrics are how it keeps itself honest: they provide breathing room for maintenance and visibility into whether policies are working.
What the tick does: periodic maintenance and checks
The per-CPU timer tick, via sched_tick, is where the scheduler updates clocks, charges CPU time, evaluates preemption, and triggers rebalancing:
void sched_tick(void)
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *donor;
struct rq_flags rf;
unsigned long hw_pressure;
u64 resched_latency;
if (housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE))
arch_scale_freq_tick();
sched_clock_tick();
rq_lock(rq, &rf);
donor = rq->donor;
psi_account_irqtime(rq, donor, NULL);
update_rq_clock(rq);
hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
if (dynamic_preempt_lazy() && tif_test_bit(TIF_NEED_RESCHED_LAZY))
resched_curr(rq);
donor->sched_class->task_tick(rq, donor, 0);
if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq);
sched_core_tick(rq);
scx_tick(rq);
rq_unlock(rq, &rf);
if (sched_feat(LATENCY_WARN) && resched_latency)
resched_latency_warn(cpu, resched_latency);
perf_event_task_tick();
if (donor->flags & PF_WQ_WORKER)
wq_worker_tick(donor);
if (!scx_switched_all()) {
rq->idle_balance = idle_cpu(cpu);
sched_balance_trigger(rq);
}
}
Conceptually, the tick does three things:
- Accounting: update time, pressure, and load averages.
-
Policy hooks: call into the current task’s scheduler class (
task_tick), core scheduling (sched_core_tick), and extensions (scx_tick). - Health checks: detect excessive reschedule latency and trigger rebalancing when needed.
Any high-throughput system needs a bounded-cost “maintenance loop” that checks invariants and nudges the system back into balance. Overusing it wastes cycles; underusing it lets skew and starvation grow. sched_tick is Linux’s carefully calibrated middle ground.
Metrics that reflect reality
The report underlying this walkthrough highlights several scheduler metrics that are directly useful in any sizable scheduler or queueing system:
| Metric | What it tells you |
|---|---|
scheduler_runqueue_length_per_cpu |
Per-CPU backlog and imbalance; long queues suggest overload or skewed work placement. |
context_switches_per_second |
Scheduling overhead; very high rates mean you’re thrashing between too many small tasks. |
wakeup_latency_histogram |
Time from wakeup to actually running; crucial for tail latency and interactive feel. |
cgroup_cpu_throttling_time |
How often CPU bandwidth limits are biting; spikes reveal misconfigured quotas. |
core_scheduling_forceidle_time |
Throughput cost of isolation; how much SMT capacity you’re giving up for security. |
Tip: When building your own scheduler or job system, start with metrics like queue length, context-switch (or dispatch) rates, wakeup latency, and throttling. They map directly to user-visible behavior and capacity planning.
Design lessons for your own systems
Walking through kernel/sched/core.c with one question—“how do we safely choose the next unit of work?”—reveals a set of design patterns that apply far beyond kernels. Here are the ones worth copying into your own schedulers, worker pools, and distributed queues.
1. Treat lifecycle and selection as separate phases
- Have a clear sequence: (1) update lifecycle state (blocked / runnable), (2) select the next runnable entity, (3) perform the switch.
- Even if they live in one hot function for performance, keep them as distinct conceptual phases with helpers like
try_to_block_taskandpick_next_task.
2. Use pluggable policies behind a narrow interface
- Expose a small vtable or interface per class/pool:
enqueue,dequeue,pick_next,task_tick, etc. - Let the core orchestrator manage ordering between classes without knowing their internals. That’s how Linux can add things like
sched_extwithout rewriting__schedule.
3. Make your state machine explicit
- Prefer several small flags with documented combinations over a single opaque enum.
- Linux’s trio—
__state,on_rq,on_cpu—makes races around wakeup and block auditable, especially with comments and memory barriers.
4. Shard state and push work to the owner
- Per-CPU runqueues avoid global lock contention; distributed queues do the same at a larger scale.
- Wakelists and functions like
__ttwu_queue_wakelistshow how to route updates to the shard owner instead of synchronously mutating remote state.
5. Hide complex mappings behind helpers
- Core scheduling changes which physical spinlock protects a given runqueue, but most code only sees helpers like
raw_spin_rq_lock_nested. - Likewise, policy aggregation (cookies, clamps, quotas) is done in helpers and pre-processing, so the hot selection loop stays simple.
6. Instrument what the scheduler actually does
- Track queue lengths, dispatch/context-switch rates, and wakeup latency distributions.
- For multi-tenant systems, monitor throttling and forced idle time per tenant or isolation level.
- Use these signals to tune policies and quotas, not just to debug incidents.
7. Accept big hot paths, but make them navigable
- Functions like
_scheduleandtry_to_wake_upwill always be complex because they sit at the intersection of many constraints. - Linux compensates with disciplined naming (
enqueue/dequeue,ttwu*,rq_lock), heavy commenting of invariants, and small helpers that encapsulate sub-steps.
The goal isn’t tiny functions everywhere; it’s large but understandable hot paths whose invariants are explicit and whose evolution is manageable.
The Linux scheduler’s core file is intimidating at first: thousands of lines, interactions with almost every subsystem, and lock diagrams that span multiple screens. But once you follow its main question—“which task runs next?”—the structure becomes clear: lifecycle and selection are distinct phases, policies are pluggable, state machines are explicit, sharded state is respected, and periodic maintenance plus metrics keep it honest.
Whether you’re building a kernel scheduler, a distributed job runner, or a background worker pool, the same patterns apply. Separate lifecycle from selection, hide policy behind narrow interfaces, make invariants explicit, shard state and ship work to its owner, and instrument what matters. That’s how Linux chooses your next CPU time slice, and it’s a design you can reuse far beyond the kernel.
Top comments (0)