<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Naufal Wiwit Putra</title>
    <description>The latest articles on DEV Community by Naufal Wiwit Putra (@naufalw).</description>
    <link>https://dev.to/naufalw</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1069863%2Fd15af11a-86ef-4f00-91c1-5d77d5ad66d0.png</url>
      <title>DEV Community: Naufal Wiwit Putra</title>
      <link>https://dev.to/naufalw</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/naufalw"/>
    <language>en</language>
    <item>
      <title>I Put an LLM Inside the Linux Kernel Scheduler. Here's What Happened.</title>
      <dc:creator>Naufal Wiwit Putra</dc:creator>
      <pubDate>Sun, 05 Apr 2026 06:01:54 +0000</pubDate>
      <link>https://dev.to/naufalw/i-put-an-llm-inside-the-linux-kernel-scheduler-heres-what-happened-1cn9</link>
      <guid>https://dev.to/naufalw/i-put-an-llm-inside-the-linux-kernel-scheduler-heres-what-happened-1cn9</guid>
      <description>&lt;p&gt;A few weeks ago, I did something that probably shouldn't work. I replaced the CPU scheduling algorithm in my Linux kernel with calls to an AI model. As on-device LLM inference capabilities grow, I am curious about its potential as a CPU scheduler. &lt;/p&gt;

&lt;p&gt;Maybe in the future, tweaking a laptop's performance is a matter of adjusting the system prompt 🤷‍♂️&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is a CPU Scheduler?
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://www.geeksforgeeks.org/operating-systems/cpu-scheduling-in-operating-systems/" rel="noopener noreferrer"&gt;CPU Scheduler&lt;/a&gt; is an operating system component that decides which task or process gets to use the CPU at a particular time. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Linux's default scheduler is called CFS (Completely Fair Scheduler). It's an algorithm that tries to give every process a fair share of CPU time, weighted by priority. It makes decisions in microseconds, fully algorithmic.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;Two things that made this feel worth trying.&lt;/p&gt;

&lt;p&gt;First, &lt;a href="https://sched-ext.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;sched_ext&lt;/strong&gt;&lt;/a&gt; landed in Linux 6.12. It's a framework that lets you write custom CPU schedulers as eBPF programs and load them into a running kernel without patching or rebooting. If you want to dive deeper into how sched_ext works, &lt;a href="https://www.youtube.com/watch?v=MXejs4KGAro" rel="noopener noreferrer"&gt;this video&lt;/a&gt; is a great starting point &lt;/p&gt;

&lt;p&gt;Second, there is a project called &lt;a href="https://github.com/nefelim4ag/Ananicy" rel="noopener noreferrer"&gt;&lt;strong&gt;ananicy&lt;/strong&gt;&lt;/a&gt; that improves scheduling by setting the &lt;code&gt;nice&lt;/code&gt; value of processes based on a static catalog that maps process names to priority levels. It works well, but the catalog has to be manually maintained. Someone has to decide that &lt;code&gt;firefox&lt;/code&gt; gets a higher priority than &lt;code&gt;updatedb&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The question I wanted to answer: could an LLM do what ananicy does, but dynamically? Could it look at a process and make a smarter decision? &lt;/p&gt;

&lt;p&gt;If it could, would it matter? &lt;/p&gt;

&lt;h2&gt;
  
  
  Version 1: Naive Implementation
&lt;/h2&gt;

&lt;p&gt;The first version was simple and absurd. For every task that needed scheduling, I sent a request to the Gemini API with the task's PID, weight, current CPU, name, and number of available CPUs, and I asked it to return the best CPU to schedule the task on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"You are a CPU scheduler. Given a task with name={}, pid={}, weight={}, &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;
     current_cpu={}, and {} available CPUs (numbered 0 to {}), &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;
     respond with ONLY a single number representing the best CPU &lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;
     to schedule this task on. Just the number, nothing else."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_cpus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_cpus&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This meant every context switch now involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serializing task metadata&lt;/li&gt;
&lt;li&gt;HTTPS request to Google's API&lt;/li&gt;
&lt;li&gt;Wait for inference&lt;/li&gt;
&lt;li&gt;Parsing the response&lt;/li&gt;
&lt;li&gt;Dispatch the task to CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;For the first time in the history of computer science, the context switching cost was literally a cost in dollars 🤣.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This approach resulted in the kernel killing my scheduler because of the watchdog, and the Gemini API gives a rate limit exceeded error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Version 2: Local Inference
&lt;/h2&gt;

&lt;p&gt;The obvious fix was to move to local inference. I swapped Gemini for Google's newly released Gemma 4. I ran the &lt;code&gt;unsloth/gemma-4-e2b-it&lt;/code&gt; in LM Studio on my host machine, accessed over VMWare's virtual network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdw1ynpokphnydf3kq4wb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdw1ynpokphnydf3kq4wb.png" alt="Local Inference Architecture" width="800" height="587"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;LM_STUDIO_URL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"http://192.168.218.1:1234/v1/chat/completions"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LM Studio exposes an OpenAI-compatible API, so the migration was mostly changing the endpoint. I also moved from per-task request to batched request, I collect all queued tasks, one API per call, get a back JSON array of &lt;code&gt;{pid, cpu}&lt;/code&gt; assignments&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;assignments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.rt&lt;/span&gt;
    &lt;span class="nf"&gt;.block_on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_ai_assignments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_cpu&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;.unwrap_or_default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.into_iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="py"&gt;.pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="py"&gt;.cpu&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result was faster than Version 1, but the fundamental problem remained. The scheduler was still blocking on every dispatch cycle, waiting for the model to respond. Even though the latency dropped to under 1 second (~500ms per request), the kernel is still terminating my scheduler.&lt;/p&gt;

&lt;h2&gt;
  
  
  Version 3: Moving AI Off the Critical Path
&lt;/h2&gt;

&lt;p&gt;The real architecture insight was this: the AI should never be in the critical path&lt;/p&gt;

&lt;p&gt;Instead of wait for ai -&amp;gt; dispatch, the flow now becomes dispatch immediately -&amp;gt; ask AI -&amp;gt; cache for the next turn&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkr8s00jxrc3y136mva6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkr8s00jxrc3y136mva6.png" alt="Cached AI Scheduler Architecture" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So on each cycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Dispatch task immediately using cache (fallback to algorithm for cache miss)&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;notify_complete&lt;/code&gt;, kernel is happy&lt;/li&gt;
&lt;li&gt;Then, fire the AI request&lt;/li&gt;
&lt;li&gt;Update the cache with the result&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Instant dispatch with fallback to algorithm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;cpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.cache&lt;/span&gt;&lt;span class="nf"&gt;.get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;top_task&lt;/span&gt;&lt;span class="py"&gt;.task.pid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cached_cpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;cached_cpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// instant, no waiting&lt;/span&gt;
    &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.bpf&lt;/span&gt;&lt;span class="nf"&gt;.select_cpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;...&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// algorithmic fallback&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the AI update happened after we notified the kernel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.bpf&lt;/span&gt;&lt;span class="nf"&gt;.notify_complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;assignments&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_result&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;assignments&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.cache&lt;/span&gt;&lt;span class="nf"&gt;.insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="py"&gt;.pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="py"&gt;.cpu&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;I benchmarked all three configurations against the CFS baseline using &lt;code&gt;schbench&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;CFS&lt;/th&gt;
&lt;th&gt;Blocking AI&lt;/th&gt;
&lt;th&gt;Cache AI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wakeup 99th percentile&lt;/td&gt;
&lt;td&gt;3,068 µs&lt;/td&gt;
&lt;td&gt;1,939,456 µs&lt;/td&gt;
&lt;td&gt;201,984 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request 99th percentile&lt;/td&gt;
&lt;td&gt;243,968 µs&lt;/td&gt;
&lt;td&gt;7,708,672 µs&lt;/td&gt;
&lt;td&gt;2,009,088 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Median RPS&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;41&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From that data, a few things jump out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The blocking version is catastrophic.&lt;/strong&gt; 99th percentile request latency of 7.7 seconds. RPS dropped from 102 to 6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cache version is a genuine improvement.&lt;/strong&gt; Moving AI off the critical path increased RPS from 6 to 41, almost a 7x improvement. Wakeup 99th dropped from 1.9 seconds to 203 milliseconds. Still much worse than CFS, but the trend is real and measurable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fast path is never the AI.&lt;/strong&gt; When the cache is warm, dispatch latency drops to 1-2 µs. But that's a HashMap lookup, not inference. The model made that decision in a previous cycle and stored it. Every fast dispatch is a case where the AI wasn't involved in the current cycle at all. &lt;/p&gt;

&lt;p&gt;The best version of this scheduler is one where the AI runs as rarely as possible, and the cache does the work, which raises an honest question about whether the AI is adding anything over just using the kernel's &lt;code&gt;select_cpu&lt;/code&gt; every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The numbers are showing that putting an LLM into the kernel scheduler makes everything worse, but the direction of improvement from v1 to v3 is real. With the current inference speed, putting the AI in the critical path during CPU scheduling is not a good idea. &lt;/p&gt;

&lt;p&gt;Thus, AI still obviously cannot beat CFS and other algorithm-based schedulers today. But maybe in the future, when on-device inference is 100x faster, it can. This project is evidence of what needs to be solved to get there.&lt;/p&gt;

&lt;p&gt;Code is at &lt;a href="https://github.com/naufalw/scheduler-exp" rel="noopener noreferrer"&gt;https://github.com/naufalw/scheduler-exp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>linux</category>
      <category>rust</category>
    </item>
  </channel>
</rss>
