I Put an LLM Inside the Linux Kernel Scheduler. Here's What Happened.

#ai #linux #rust

A few weeks ago, I did something that probably shouldn't work. I replaced the CPU scheduling algorithm in my Linux kernel with calls to an AI model. As on-device LLM inference capabilities grow, I am curious about its potential as a CPU scheduler.

Maybe in the future, tweaking a laptop's performance is a matter of adjusting the system prompt 🤷‍♂️

What Is a CPU Scheduler?

CPU Scheduler is an operating system component that decides which task or process gets to use the CPU at a particular time.

Linux's default scheduler is called CFS (Completely Fair Scheduler). It's an algorithm that tries to give every process a fair share of CPU time, weighted by priority. It makes decisions in microseconds, fully algorithmic.

The Idea

Two things that made this feel worth trying.

First, sched_ext landed in Linux 6.12. It's a framework that lets you write custom CPU schedulers as eBPF programs and load them into a running kernel without patching or rebooting. If you want to dive deeper into how sched_ext works, this video is a great starting point

Second, there is a project called ananicy that improves scheduling by setting the nice value of processes based on a static catalog that maps process names to priority levels. It works well, but the catalog has to be manually maintained. Someone has to decide that firefox gets a higher priority than updatedb.

The question I wanted to answer: could an LLM do what ananicy does, but dynamically? Could it look at a process and make a smarter decision?

If it could, would it matter?

Version 1: Naive Implementation

The first version was simple and absurd. For every task that needed scheduling, I sent a request to the Gemini API with the task's PID, weight, current CPU, name, and number of available CPUs, and I asked it to return the best CPU to schedule the task on.

let prompt = format!(
    "You are a CPU scheduler. Given a task with name={}, pid={}, weight={}, \
     current_cpu={}, and {} available CPUs (numbered 0 to {}), \
     respond with ONLY a single number representing the best CPU \
     to schedule this task on. Just the number, nothing else.",
    name, pid, weight, current_cpu, num_cpus, num_cpus - 1
);

This meant every context switch now involves:

Serializing task metadata
HTTPS request to Google's API
Wait for inference
Parsing the response
Dispatch the task to CPU

For the first time in the history of computer science, the context switching cost was literally a cost in dollars 🤣.

This approach resulted in the kernel killing my scheduler because of the watchdog, and the Gemini API gives a rate limit exceeded error.

Version 2: Local Inference

The obvious fix was to move to local inference. I swapped Gemini for Google's newly released Gemma 4. I ran the unsloth/gemma-4-e2b-it in LM Studio on my host machine, accessed over VMWare's virtual network.

const LM_STUDIO_URL: &str = "http://192.168.218.1:1234/v1/chat/completions";

LM Studio exposes an OpenAI-compatible API, so the migration was mostly changing the endpoint. I also moved from per-task request to batched request, I collect all queued tasks, one API per call, get a back JSON array of {pid, cpu} assignments

let assignments: HashMap<i32, i32> = self.rt
    .block_on(get_ai_assignments(&self.client, &tasks, num_cpu))
    .unwrap_or_default()
    .into_iter()
    .map(|a| (a.pid, a.cpu))
    .collect();

The result was faster than Version 1, but the fundamental problem remained. The scheduler was still blocking on every dispatch cycle, waiting for the model to respond. Even though the latency dropped to under 1 second (~500ms per request), the kernel is still terminating my scheduler.

Version 3: Moving AI Off the Critical Path

The real architecture insight was this: the AI should never be in the critical path

Instead of wait for ai -> dispatch, the flow now becomes dispatch immediately -> ask AI -> cache for the next turn

So on each cycle:

Dispatch task immediately using cache (fallback to algorithm for cache miss)
Call notify_complete, kernel is happy
Then, fire the AI request
Update the cache with the result

Instant dispatch with fallback to algorithm:

let cpu = match self.cache.get(&top_task.task.pid) {
    Some(&cached_cpu) => cached_cpu,  // instant, no waiting
    None => self.bpf.select_cpu(...)  // algorithmic fallback
};

And the AI update happened after we notified the kernel:

self.bpf.notify_complete(0); 

if let Ok(Some(assignments)) = ai_result {
    for a in assignments {
        self.cache.insert(a.pid, a.cpu);
    }
}

Result

I benchmarked all three configurations against the CFS baseline using schbench:

	CFS	Blocking AI	Cache AI
Wakeup 99th percentile	3,068 µs	1,939,456 µs	201,984 µs
Request 99th percentile	243,968 µs	7,708,672 µs	2,009,088 µs
Median RPS	102	6	41

From that data, a few things jump out:

The blocking version is catastrophic. 99th percentile request latency of 7.7 seconds. RPS dropped from 102 to 6.

The cache version is a genuine improvement. Moving AI off the critical path increased RPS from 6 to 41, almost a 7x improvement. Wakeup 99th dropped from 1.9 seconds to 203 milliseconds. Still much worse than CFS, but the trend is real and measurable.

The fast path is never the AI. When the cache is warm, dispatch latency drops to 1-2 µs. But that's a HashMap lookup, not inference. The model made that decision in a previous cycle and stored it. Every fast dispatch is a case where the AI wasn't involved in the current cycle at all.

The best version of this scheduler is one where the AI runs as rarely as possible, and the cache does the work, which raises an honest question about whether the AI is adding anything over just using the kernel's select_cpu every time.

Conclusion

The numbers are showing that putting an LLM into the kernel scheduler makes everything worse, but the direction of improvement from v1 to v3 is real. With the current inference speed, putting the AI in the critical path during CPU scheduling is not a good idea.

Thus, AI still obviously cannot beat CFS and other algorithm-based schedulers today. But maybe in the future, when on-device inference is 100x faster, it can. This project is evidence of what needs to be solved to get there.

Code is at https://github.com/naufalw/scheduler-exp