GPU inference often looks like a GPU problem, but the CPU still sits on the critical path. It prepares inputs, launches CUDA kernels, manages synchronization, handles runtime calls, and shares cores with system work, interrupts, and other tenants. If that CPU-side launch path is delayed, the GPU can be left waiting even when the GPU kernels themselves are fast.
This post asks a concrete question: when an LLM inference workload is running on a GPU, how much do Linux CPU scheduling decisions and IRQ handling actually matter?
To answer it, we built an eBPF tracing tool, cuda_sched_trace, that records CUDA kernel launches, scheduler context switches, and hard/soft IRQ events with nanosecond timestamps. We then ran Qwen3 0.6B inference under clean and noisy-neighbor conditions: CPU load from stress-ng, network load from iperf3, disk load from fio, a combined heavy-load case, and a mitigation case using CPU pinning and priority adjustment.
The short version: in a clean environment, scheduler and IRQ overhead are small. Under production-like noisy-neighbor conditions, they can become very real. Combined CPU, network, and disk interference reduced throughput by 20.5%, while simple CPU pinning reduced context switches by 96.3% and recovered most of the lost throughput.
Why CPU Scheduling Shows Up in GPU Inference
Modern GPU workloads, particularly LLM inference and training, require tight coordination between CPU and GPU execution. The CPU is responsible for:
- preparing input data and kernel parameters
- launching GPU kernels through CUDA APIs
- managing memory transfers and synchronization
An interruption to that CPU-side workflow can delay GPU kernel submission. In the worst case, the GPU has available compute capacity but no new work to execute.
The motivation comes partly from Meta's work on sched_ext for AI training optimization, where production issues include "IRQs preempting our important tasks." Network interrupts (NET_RX/NET_TX) and block device interrupts can matter for large distributed training jobs, and custom scheduling policies can improve AI workload performance by 5-20%.
But the impact is workload-dependent. A single-node LLM inference loop is not the same as distributed training with all-reduce traffic. Before investing in custom scheduling, we wanted measurements that separate scheduler problems from normal application behavior.
The study has four goals:
- Measure the baseline impact of CPU scheduling on GPU kernel launches.
- Characterize IRQ interference patterns and their performance cost.
- Quantify noisy-neighbor impact under CPU, network, disk, and combined load.
- Evaluate how much CPU pinning and priority adjustment help.
Tracing the Launch Path
We developed cuda_sched_trace, an eBPF-based tracing tool that combines CUDA API uprobes, Linux scheduler tracepoints, and IRQ tracepoints.
CUDA API Tracing
The tool attaches uprobes to CUDA Driver and Runtime APIs:
// Attach to CUDA Driver API
SEC("uprobe/cuLaunchKernel")
int trace_cuLaunchKernel(struct pt_regs *ctx) {
// Capture: timestamp, pid, tid, grid/block dimensions, shared memory, stream
// Mark process as GPU process for scheduler tracking
}
// Attach to CUDA Runtime API
SEC("uprobe/cudaLaunchKernel")
int trace_cudaLaunchKernel(struct pt_regs *ctx) { ... }
SEC("uprobe/cudaDeviceSynchronize")
int trace_cudaDeviceSynchronize_enter(struct pt_regs *ctx) { ... }
SEC("uretprobe/cudaDeviceSynchronize")
int trace_cudaDeviceSynchronize_exit(struct pt_regs *ctx) { ... }
Scheduler Event Tracing
Scheduler activity is captured through sched_switch, filtered to GPU-related processes:
SEC("tp_btf/sched_switch")
int BPF_PROG(sched_switch, bool preempt, struct task_struct *prev, struct task_struct *next) {
// Only track if prev or next is a GPU process
// Record: timestamp, prev/next pid, off-cpu/on-cpu duration
}
IRQ Tracing
Hard and soft IRQs are tracked through kernel tracepoints:
SEC("tp_btf/irq_handler_entry")
int BPF_PROG(irq_handler_entry, int irq, struct irqaction *action) {
// Track hard IRQ entry, record IRQ number and handler name
}
SEC("tp_btf/irq_handler_exit")
int BPF_PROG(irq_handler_exit, int irq, struct irqaction *action) {
// Calculate IRQ duration
}
SEC("tp_btf/softirq_entry")
int BPF_PROG(softirq_entry, unsigned int vec_nr) {
// Track soft IRQ: TIMER, NET_RX, NET_TX, BLOCK, SCHED, RCU, etc.
}
SEC("tp_btf/softirq_exit")
int BPF_PROG(softirq_exit, unsigned int vec_nr) {
// Calculate soft IRQ duration
}
The data path is straightforward: the GPU application issues CUDA calls; eBPF programs observe CUDA, scheduler, and IRQ events in kernel space; events are sent through a BPF ring buffer; analysis scripts parse the resulting CSV.
┌─────────────────────────────────────────────────────────────────┐
│ User Space │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ GPU App │ │ cuda_sched │ │ Analysis Scripts │ │
│ │ (qwen3.cu) │ │ _trace │ │ (Python) │ │
│ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │
│ │ │ │ │
│ │ CUDA calls │ perf_event │ CSV parsing │
│ ▼ ▼ ▼ │
├─────────────────────────────────────────────────────────────────┤
│ Kernel Space │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ uprobes │ │ tracepoints │ │ BPF Ring Buffer │ │
│ │ (CUDA API) │ │ (sched,irq) │ │ (Event Queue) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Benchmark and Environment
The benchmark is Qwen3 0.6B LLM inference using qwen3.cu.
| Property | Value |
|---|---|
| Model | Qwen3-0.6B-FP32 |
| Task | Single-turn Q&A |
| Input | "What is eBPF?" |
| Output | ~30-50 tokens |
| Kernel Pattern | Burst submission (~950 launches per token) |
| GPU Memory | ~3 GB |
This benchmark is useful because it resembles modern LLM inference, mixes compute-bound and memory-bound kernels, shows a clear burst submission pattern, and produces a measurable throughput metric in tokens per second.
| Component | Specification |
|---|---|
| CPU | 24 cores (specific model TBD) |
| GPU | NVIDIA GPU with CUDA support |
| Memory | Sufficient for model + system |
| OS | Linux 6.15.11-061511-generic |
| Kernel | BTF-enabled for CO-RE eBPF |
| CUDA | Driver API + Runtime API |
We used three interference tools:
| Tool | Purpose | Configuration |
|---|---|---|
| stress-ng | CPU load |
--cpu 0 --cpu-method fft (all cores) |
| iperf3 | Network I/O | Server + Client, 10 parallel streams, 60s |
| fio | Disk I/O | randwrite, bs=4k, iodepth=32, 4 jobs |
The full experiment has six scenarios:
| Scenario | Description | Interference |
|---|---|---|
| Baseline | Clean environment | None |
| Noisy CPU | CPU-intensive | stress-ng on all cores |
| Noisy Network | Network I/O | iperf3 localhost loopback |
| Noisy Disk | Disk I/O | fio random write |
| Heavy Load | Combined | CPU + Network + Disk simultaneously |
| Optimized | CPU pinning | stress-ng + taskset -c 0-3 + nice -n -10 |
Data collection follows the same pattern in every run:
# Start tracing
sudo ./cuda_sched_trace > trace.csv 2> trace.log &
TRACE_PID=$!
# Run benchmark
cd qwen3.cu
/usr/bin/time -v ./runcu Qwen3-0.6B-FP32.gguf -q "What is eBPF?" -r 1
# Stop tracing
sudo kill -SIGINT $TRACE_PID
# Analyze results
python3 analyze_scheduler_impact.py
Analysis Method
The central analysis compares consecutive CUDA kernel launches:
Launch_i -> [interval] -> Launch_i+1
Group A: Launches with NO context switch in interval (normal flow)
Group B: Launches with context switch in interval (preempted)
Preemption Penalty = median(Group B interval) - median(Group A interval)
To compare runs of different lengths, scheduler and IRQ counts are normalized per 1,000 kernel launches:
Sched/1K = (Total Context Switches / Total Kernel Launches) x 1000
IRQ/1K = (Total IRQs / Total Kernel Launches) x 1000
Performance impact is reported as:
Slowdown % = (Baseline tok/s - Scenario tok/s) / Baseline tok/s x 100
RQ1: Does CPU Scheduler Significantly Impact GPU Performance in Clean Environments?
The first question is whether scheduler preemption matters when the machine is otherwise clean.
Experiment design
- Condition: clean system, no artificial interference
- Metrics: context switch frequency, preemption penalty, total runtime impact
- Analysis: launch-pair comparison with and without context switches
Results
| Metric | Value |
|---|---|
| Total Runtime | 79.5 seconds |
| Kernel Launches | 51,464 |
| Context Switches | 592 (7.44 Hz) |
| OFF-CPU Time | 7.88 ms (0.01%) |
Launch-pair analysis shows that almost every consecutive launch pair is unaffected by context switches:
| Group | Count | Percentage | P50 Interval | P90 Interval | P99 Interval |
|---|---|---|---|---|---|
| No Context Switch | 51,401 | 99.9% | 2 us | 4 us | 4 us |
| With Context Switch | 62 | 0.1% | 15.3 ms | 15.5 ms | 5.0 s |
The median preemption penalty is 15.3 ms. That is large for the affected pairs, but only 62 pairs were affected.
Tail-latency attribution confirms that most outliers are not caused by scheduler preemption:
| Percentile | Total Outliers | With Context Switch | Attribution |
|---|---|---|---|
| P95+ | 2,580 | 62 (2.4%) | 97.6% application |
| P99+ | 515 | 62 (12.0%) | 88.0% application |
The total scheduler impact is:
Impact = Affected Pairs x Penalty = 62 x 15ms = 0.93 seconds
Percentage = 0.93 / 79.5 = 1.2%
Finding: in clean environments, CPU scheduler impact is minimal at 1.2%. The vast majority of kernel launch pairs, 99.9%, are unaffected by context switches. Tail latency mostly comes from application behavior such as token-generation boundaries, not scheduler preemption.
RQ2: What Is the Impact of IRQ Interrupts on GPU Performance?
The second question is whether IRQs directly interfere with the CPU-side launch path.
Experiment design
- Condition: clean system with IRQ tracing enabled
- Metrics: IRQ frequency, duration, type distribution
- Analysis: IRQ time as percentage of total runtime
Results
| Metric | Value |
|---|---|
| Total Runtime | 4.99 seconds |
| Kernel Launches | 125,236 |
| Soft IRQs | 653 events |
| Hard IRQs | 0 events |
Soft IRQ type distribution:
| Type | Count | Total Time | Avg Time | Max Time | Percentage |
|---|---|---|---|---|---|
| TIMER | 317 | 0.77 ms | 2.4 us | 30.1 us | 49% |
| RCU | 291 | 0.40 ms | 1.4 us | 17.2 us | 45% |
| NET_RX | 30 | 0.13 ms | 4.5 us | 14.0 us | 4.6% |
| SCHED | 15 | 0.07 ms | 4.9 us | 18.9 us | 2.3% |
Total IRQ impact:
Total IRQ Time: 1.38 ms
Percentage of Runtime: 0.0276%
There are real reasons to worry about IRQs: direct handler time, cache pollution, CPU pipeline disruption, and delay accumulation on critical paths. But for this local inference workload, actual IRQ impact is small.
The reason is the workload shape. Qwen3 submits about 950 launches in a burst lasting less than 100 us, so IRQs rarely land inside the burst. Most IRQs happen between bursts during CPU compute. TIMER interrupts dominate and have a small cache footprint. There is little network I/O, so NET_RX appears only 30 times, and there are no hard IRQs from NVMe or SSD block-device interrupts.
Finding: IRQ impact is negligible for local LLM inference at 0.0276%. This does not mean IRQs never matter. Distributed training with network communication or on-the-fly data loading can see much higher IRQ impact, estimated around 5-20%.
RQ3: How Do Noisy Neighbors Affect GPU Performance?
The third question is the most production-relevant one: what happens when the GPU workload shares a machine with other CPU, network, and disk activity?
Experiment design
| Scenario | Interference | Purpose |
|---|---|---|
| Baseline | None | Reference point |
| Noisy CPU | stress-ng (all cores) | CPU contention |
| Noisy Network | iperf3 (10 streams) | Network IRQ |
| Noisy Disk | fio (4 jobs, randwrite) | Block IRQ |
| Heavy Load | All three combined | Production simulation |
| Optimized | CPU stress + taskset + nice | Mitigation test |
Results
Normalized metrics per 1,000 kernel launches:
| Scenario | Launches | Sched/1K | Soft IRQ/1K | Hard IRQ/1K | IRQ Time (ms) |
|---|---|---|---|---|---|
| Baseline | 56,882 | 22.8 | 5.8 | 0.0 | 0.62 |
| Noisy CPU | 61,184 | 11,932.8 | 6.4 | 0.0 | 0.33 |
| Noisy Network | 154,394 | 6.0 | 2.7 | 0.0 | 0.92 |
| Noisy Disk | 126,670 | 29.3 | 3.9 | 0.1 | 1.03 |
| Heavy Load | 99,424 | 6,044.6 | 2.4 | 0.0 | 0.37 |
| Optimized | 108,984 | 445.2 | 2.8 | 0.0 | 0.71 |
Performance impact:
| Scenario | tok/s | Runtime (s) | Slowdown | Context Switch Increase |
|---|---|---|---|---|
| Baseline | 54.77 | 3.00 | - | 1x |
| Noisy CPU | 49.93 | 4.15 | 8.8% | 524x |
| Noisy Network | 53.23 | 7.22 | 2.8% | 0.26x |
| Noisy Disk | 54.95 | 5.60 | -0.3% | 1.3x |
| Heavy Load | 43.56 | 6.97 | 20.5% | 265x |
| Optimized | 53.75 | 5.10 | 1.9% | 19.5x |
Scenario Analysis
Noisy CPU (stress-ng) causes the most direct scheduling pressure. Context switches increase 524x, from 22.8 to 11,932.8 per 1,000 launches, and throughput drops by 8.8%. The mechanism is simple: the CFS scheduler time-slices between the GPU process and stress-ng workers.
Noisy Network (iperf3) behaves differently. Context switches actually decrease, because the network load changes CPU competition patterns, while soft IRQs rise slightly. Throughput drops only 2.8%. In this local setup, network I/O primarily shows up as IRQ overhead rather than scheduler pressure.
Noisy Disk (fio) introduces the first hard IRQs, corresponding to block-device interrupts, but context switches remain low and throughput is effectively unchanged at -0.3% slowdown. Disk I/O has little impact on this workload.
Heavy Load (CPU + Network + Disk) is the worst case. Throughput drops by 20.5%, and scheduler events rise to 6,044.6 per 1,000 launches, a 265x increase over baseline. Interestingly, that is only 50.7% of the context-switch rate in the Noisy CPU case. The interference sources compete with each other, but their combined effect is still worst overall.
Heavy-load soft IRQ breakdown:
| Type | Count | Total Time | Avg Time |
|---|---|---|---|
| RCU | 213 | 217.4 us | 1.0 us |
| TIMER | 17 | 122.9 us | 7.2 us |
| SCHED | 5 | 33.3 us | 6.7 us |
Finding: noisy neighbors significantly affect GPU performance. Combined CPU, network, and disk interference causes 20.5% degradation. The signatures differ by source: CPU contention increases context switches, network I/O affects IRQ overhead, disk I/O introduces block interrupts with little throughput impact here, and combined load is worst due to cumulative effects.
RQ4: Can CPU Pinning Effectively Mitigate Scheduler Impact?
The fourth question is whether a simple deployment-level mitigation helps before reaching for a custom scheduler.
Experiment design
- Baseline: Noisy CPU scenario with
stress-ngon all cores - Optimized: same
stress-ngload, but the GPU process runs with:-
taskset -c 0-3to pin it to cores 0-3 -
nice -n -10to give it higher priority
-
Results
| Metric | Noisy CPU | Optimized | Improvement |
|---|---|---|---|
| Sched/1K | 11,932.8 | 445.2 | 96.3% reduction |
| tok/s | 49.93 | 53.75 | 7.6% improvement |
| vs. Baseline | 8.8% slower | 1.9% slower | Significant recovery |
CPU pinning and priority adjustment recover most of the lost throughput. But the optimized case still has 445.2 scheduler events per 1,000 launches, compared with 22.8 in the clean baseline. That is still 19.5x higher than baseline.
Complete elimination is hard because:
-
stress-ngworkers may still be scheduled on cores 0-3. - System daemons and kernel threads cannot be fully excluded by
taskset. - IRQ affinity may still route interrupts to pinned cores.
For stronger isolation, the next steps are kernel-level isolation and IRQ placement:
# 1. Use isolcpus kernel parameter (boot time)
isolcpus=4-7 nohz_full=4-7
# 2. Bind GPU process to isolated cores
taskset -c 4-7 ./gpu_app
# 3. Bind IRQs away from GPU cores
echo 0-3 > /proc/irq/*/smp_affinity_list
# 4. Use cgroups for CPU isolation
cgcreate -g cpu:gpu_workload
cgset -r cpuset.cpus=4-7 gpu_workload
cgexec -g cpu:gpu_workload ./gpu_app
Finding: CPU pinning is highly effective. It reduces context switches by 96.3% and recovers 7.6% throughput. But full recovery under heavy load requires deeper isolation such as isolcpus, nohz_full, cpusets, and IRQ affinity management.
What the Results Mean
The results point to four practical insights.
First, environment matters. Scheduler impact ranges from 1.2% in a clean environment to 20.5% under combined heavy load. Optimizing the scheduler on a quiet dedicated server may not be worth the complexity. On a shared host, it can be the difference between stable and degraded inference.
Second, workload shape matters. Qwen3 has bursty kernel submission, roughly 950 launches in less than 100 us per token burst. That shape makes it resilient to many IRQs because interrupts usually occur between bursts. A different workload with continuous network communication, streaming input, or tighter CPU-GPU handoff might behave differently.
Third, interference sources have distinct signatures:
| Interference | Primary Impact | Secondary Impact |
|---|---|---|
| CPU | Context switches | None |
| Network | IRQ overhead | Slight scheduling |
| Disk | Hard IRQs | Minimal |
| Combined | All of above | Worst overall |
Fourth, simple mitigations work, but only up to a point:
- CPU pinning: very effective, 96% context-switch reduction
- Priority adjustment: helpful but limited
- Full isolation: requires kernel configuration and IRQ affinity management
Comparison with Meta's sched_ext Findings
Our results differ from Meta's AI training observations because the workload is different.
| Aspect | Meta (AI Training) | Our Study (LLM Inference) |
|---|---|---|
| Primary Issue | Network IRQ (NET_RX) | CPU scheduling |
| IRQ Impact | 5-20% | 0.03% (local inference) |
| Optimization | sched_ext layer | taskset + nice |
| Workload | Distributed training | Single-node inference |
The key difference is communication. Distributed training constantly exchanges data through all-reduce, making NET_RX a major bottleneck. Local inference has minimal network I/O, so the dominant issue under noise is CPU scheduling rather than network interrupts.
Limitations
There are several limits to this study:
- eBPF tracing itself adds 1-5% overhead.
- The tool only supports CUDA, not OpenCL or HIP.
- The trace does not include GPU-side execution timing, so it cannot directly measure actual kernel runtime.
- IRQ attribution is limited: the trace cannot always identify which process caused a given IRQ.
- The experiments use a single GPU and do not cover multi-GPU behavior.
Practical Recommendations
For production deployments:
| Environment | Recommendation | Expected Benefit |
|---|---|---|
| Dedicated Server | No optimization needed | - |
| Shared Server (light) | taskset + nice | 5-10% improvement |
| Shared Server (heavy) | isolcpus + IRQ affinity | 15-20% improvement |
| Kubernetes | CPU limits + nodeSelector | Varies |
The decision tree is simple:
Is GPU workload latency-sensitive?
├── No -> No optimization needed
└── Yes -> Is server shared?
├── No -> Monitor only, optimize if needed
└── Yes -> How heavy is colocated load?
├── Light -> taskset + nice
└── Heavy -> isolcpus + dedicated cores
Conclusion
CPU scheduling and IRQ handling do not always matter for GPU inference, but they matter under the conditions where production systems often run: shared hosts, background load, and noisy neighbors.
The clean baseline shows minimal overhead: 1.2% scheduler impact and 0.03% IRQ impact. But combined CPU, network, and disk interference causes 20.5% throughput degradation. CPU pinning cuts context switches by 96.3% and recovers most of the lost performance, but not all of it.
The practical lesson is to measure first. Use tracing to identify whether your workload is scheduler-bound, IRQ-sensitive, or mostly application-limited. Then choose the mitigation that matches the signature: CPU pinning for CPU contention, IRQ affinity for interrupt interference, I/O tuning for block-device pressure, and full CPU isolation when the workload is latency-sensitive and colocated load is heavy.
References
- Meta Platforms, Inc. "Accelerating AI Training with sched_ext." Linux Plumbers Conference 2025. https://lpc.events/event/19/contributions/2039/
- NVIDIA Corporation. "CUDA Driver API Reference." https://docs.nvidia.com/cuda/cuda-driver-api/
- Linux Kernel Documentation. "BPF Documentation." https://www.kernel.org/doc/html/latest/bpf/
- stress-ng. "A tool to load and stress a computer system." https://github.com/ColinIanKing/stress-ng
- iperf3. "A TCP, UDP, and SCTP network bandwidth measurement tool." https://github.com/esnet/iperf
- fio. "Flexible I/O Tester." https://github.com/axboe/fio
Top comments (1)
great insights on how CPU scheduling affects GPU performance. it's interesting to see the nuances of how interrupts can impact inference speed. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code. if you want to give it a try, I can set you up for a free run.