Bhushitha Hashan

Posted on Mar 6

A Deep Dive into Kernel Context Switching

#linux #c #assembly #systemdesign

The afternoon sun was hitting the glass. Marcus, a senior engineer who had seen more kernel panics than he had hot meals, sat leaning back in his chair. Opposite him, Leo, the new intern, was staring intensely at a series of assembly snippets on his dual-monitor setup.

"Marcus," Leo said, spinning around. "I’m looking at how Linux handles processes. I get the theory, but the hardware part feels like magic. There’s no task_struct_ptr register on an x86 CPU. So how does the kernel actually know who it is the millisecond a system call happens?"

Marcus grinned. "You’ve hit the 'GPS' problem of OS design. Pull up a chair. You have to understand that the CPU is a bit of a blank slate; the OS has to carve its own landmarks into the hardware."

Why the GS Register is the Kernel’s North Star

"To answer your question," Marcus began, "we have to talk about the GS register. In the world of x86_64, OS designers decided that general-purpose registers like RAX or RBX were too precious to waste. If you used RBX to hold the pointer to the current task, you’d have one less register for math in every single function. Plus, if a programmer accidentally overwrote it, the system would forget which task was running and crash immediately."

He pointed to Leo’s screen. "Instead, Linux uses the GS register as a stable anchor. It’s reserved specifically to point to Per-CPU Data. Each CPU core has its own GS base address that points to a private memory area for that specific core."

The Transition: From User to Kernel

"When you’re in User Mode," Marcus continued, "the GS register might be used for thread-local storage. But the moment a system call or an interrupt happens, the CPU executes a specialized instruction called swapgs. This instruction flips the GS base to a kernel-specific area. Inside that area, the kernel stores the most important information: the Task Struct and the Kernel Stack."

"So it's like a 'GPS Home' button?" Leo asked.

"Exactly. When the kernel code needs to know 'Who am I?', it doesn't search a list. It just executes something like mov rax, gs:[offset_of_current_task]. It’s a direct, protected lookup."

Entry and the Trap Frame

"But before the kernel even thinks about switching tasks," Marcus said, "it has to protect the current one. The moment a user program triggers a syscall or an interrupt, the CPU executes swapgs. This flips the GS base to the kernel’s side. But then, the kernel performs its first vital ritual: Saving the User State."

"Wait, so it doesn't just jump into kernel code?" Leo asked.

"Not without a backup. The kernel immediately pushes all the user's current registers—RAX, R11, RCX, and the rest—onto the kernel stack. This collection is called the Trap Frame (or pt_regs). This ensures that no matter what the kernel does next, it has a perfect 'save-point' of exactly what the user was doing. Without this step, we could never return to the application."

The RSP Dillema

"Wait,What happened to the user stack pointer ?Do we just save it too while switching?" leo asked.Marcus leaned back, laughing softly. "That’s the exact question that keeps kernel developers up at night. You're thinking about the 'Exit'the popping. But the problem is the 'Entrance'. It’s a game of musical chairs where the CPU only has one seat, and that seat is the RSP register."

He grabbed a dry erase marker and drew a single box on the board labeled RSP.

"Here is your problem, Leo, The RSP register can only hold one address at a time. It’s either pointing at the User Stack or the Kernel Stack. It cannot point to both."

The "One-Handed" Problem

"Imagine you are holding a priceless Ming vase.that's the User Stack Pointer," Marcus said, gesturing with his hands. "You need to put it into a safety deposit box.that's the Kernel Stack. But the only way to open the safety deposit box is to use the same hand that is currently holding the vase."

"The moment you reach for the box (the Kernel Stack), you have to let go of the vase (the User Stack Pointer). If you let go, it smashes on the floor. You've lost it."

The "Stash" Before the "Push"

"To get that User RSP value into the pt_regs struct on the kernel stack, you have to perform a very specific sequence of moves. Watch the registers closely:"

The Entry: You just hit syscall. RSP is currently holding the User Stack address.

The Dilemma: You need to change RSP to point to the Kernel Stack so you can start pushing pt_regs.

The Move: But if you just do mov rsp, gs:[top_of_stack], the User Stack address is gone. It’s overwritten in the register. You can't 'push' it later because you don't know what it was anymore.

"This is why we use gs:[user_stack_ptr] as a temporary shelf," Marcus explained.

"The kernel executes:

mov gs:[user_stack_ptr], rsp (Putting the vase on the shelf)

mov rsp, gs:top_of_stack

Now that RSP is safely pointing to the Kernel Stack, the kernel can finally do the 'Push' you’re talking about:

push gs:user_stack_ptr"

"Can't we just pop?"

"You're right about the end of the story," Marcus nodded. "When the system call is over, we do pop the registers from the pt_regs back into the CPU. But you can't pop something that was never pushed. And you can't push the User RSP unless you have a temporary place to hold it while you're switching the RSP register itself."

Leo stared at the whiteboard. "So it's literally just a bridge? A place to hold the value for, like, three instructions?"

"Exactly," Marcus said. "It’s a 'scratchpad.' Once the User RSP is safely pushed into the pt_regs structure on the kernel stack, the value sitting in pcpu_hot.user_stack_ptr isn't actually used for much else until the next time that CPU core handles a transition. It's the only way to 'teleport' the stack pointer from a register to a memory location without losing it in the process."

"Why not use a general register like RAX?"You might ask: 'Why not just move RSP into RAX temporarily?'" Marcus anticipated the next question. "Because RAX is a User Register. It contains the System Call Number (like 1 for write or 0 for read). If you move the stack pointer into RAX, you just deleted the user's instructions on what they wanted the kernel to do!"

"The GS-based memory is the only place in the entire universe the kernel can 100% trust to be 'clean' and available the moment a syscall starts."

Marcus tossed the marker back onto the tray. "It’s all about that tiny window of time where the CPU is 'homeless' between stacks. user_stack_ptr is the kernel’s temporary roof."

The Great Teleportation: Updating the Identity

Leo tapped his pen. "Okay, so we’ve saved the user registers. Now the scheduler wants to move from Task A to Task B. How does the GS area know Task B is the new king?"

"This is the crucial part people often miss," Marcus said. "The scheduler doesn't just swap the stack; it has to re-program the identity of that CPU core. Inside the context switch function, the kernel physically overwrites the memory inside the Per-CPU area.

It takes Task B’s task_struct address and writes it into gs:[offset_of_current_task]. Then, it takes Task B’s kernel stack pointer and writes it into gs:[offset_to_top_of_stack].

Crucially, we don't change the GS Base address itself,that still points to the same CPU core. We just update the 'data' inside that core's private office. Once those pointers are updated, any part of the kernel that asks 'Who is the current task?' will now get Task B as the answer."

The Performance Shortcut: Beating the Memory Latency

Leo frowned, tapping his pen. "Wait. If the kernel has to find the gs base, then find the task_struct, then find the pointer to the stack inside that struct... isn't that a lot of memory hopping? That sounds slow for something that happens thousands of times a second."

"You’re right," Marcus nodded. "That's a classic bottleneck. Modern Linux engineers hate extra memory hops. To fix this, they use redundancy for speed. They don't go to the task_struct every time they need the stack."

"Inside that Per-CPU area pointed to by GS, there's a structure—often called pcpu_hot. The kernel stores a direct copy of the 'top of the stack' right there, just bytes away from the task pointer.

When a syscall triggers:

The CPU is still on the User Stack (RSP).
The kernel runs swapgs.
It immediately executes mov rsp, gs:[0x00] (or whatever the stack shortcut offset is).

In one instruction, the kernel is on the correct stack. It never even had to touch the task_struct to find it. Because this data is accessed constantly, it almost always lives in the L1 Cache. It’s nearly as fast as a physical register."

The Great Teleportation: How Context Switching Actually Works

"Okay," Leo said, leaning in. "So we have the pointers. But how does the scheduler actually stop Task A and start Task B? How does Task B know where it left off?"

Marcus cleared his throat, warming up to the narrative. "This is the 'Save Game' trick. Think of a context switch as a 'handover.' When the scheduler decides to switch, it calls a function named __switch_to(). This is where the magic happens."

1. Freezing Task A

Before Task A is moved aside, the kernel must 'freeze' its state. It pushes the current registers—RBX, RBP, and others onto Task A's own kernel stack. The very last thing it does is save the current RSP (the stack pointer) into Task A's task_struct.

2. The RSP Swap

"Now comes the moment of teleportation," Marcus said, mimicking a switch flip with his hand. "The scheduler loads the address of Task B. It takes the stack pointer that Task B saved the last time it was frozen and moves it into the actual RSP register."

# rax = Task B's task_struct
mov rsp, [rax + stack_offset]  # THE SWITCH: rsp now points to Task B's stack!

"The second that instruction executes, the CPU is no longer looking at Task A’s history. It is looking at Task B’s stack."

The Layered Timeline: Why the Stack is a Time Machine

Leo looked puzzled. "But how does Task B know what to do next? Does the scheduler have to jump to a specific instruction?"

"That’s the beauty of it," Marcus replied. "The scheduler doesn't 'jump.' It returns. You see, every task follows the same ritual to go to sleep, so they all know how to wake up. A task’s kernel stack is a perfectly preserved timeline with two layers."

The Two Layers:

Layer 1 (The Trap Frame): This is the User State. It’s at the very bottom of the stack. It contains the User RIP (where the app was in its own code) and the User RSP.
Layer 2 (The Context): This is the Kernel State. It contains the registers the task was using inside the kernel just before it was switched out.

"Because the stack is First-In, Last-Out (FILO), the task just travels backward through its own history.

It pops the kernel registers (restoring its 'Kernel Self').
It executes the ret instruction.

Now, here is the secret: The ret instruction pops the top value of the stack into the RIP (Instruction Pointer). Since Task B was frozen during a function call, the top of its stack contains the 'return address',the exact line of code following the switch. It literally 'returns' into its own past."

The Final Piece: The Kernel RIP

"Wait," Leo interrupted, "so the Kernel RIP is never explicitly saved with a mov command?"

"Exactly!" Marcus clapped his hands. "When the code calls __switch_to, the call instruction automatically pushes the next RIP onto the stack as a bookmark. Then, when the task wakes up later and hits ret, it simply picks up that bookmark and continues.

Once the kernel finishes its work, it executes a final sysret or iretq. This pops Layer 1,the Trap Frame.restoring the original User RIP and RSP. The CPU is suddenly back in the user application, exactly where it left off, unaware that it was ever frozen."

Marcus stood up and headed toward the coffee machine. "The stack isn't just memory, Leo. It’s a timeline. As long as you keep the RSP and the GS base pointed at the right spots, the CPU will always know exactly who it is and what it was doing."

Leo turned back to his monitors, the assembly code finally looking less like a wall of text and more like a carefully choreographed dance.

The Blueprint: `struct pcpu_hot`

In modern kernels, it looks like this. Notice how current_task and top_of_stack are right at the top for maximum speed."

struct pcpu_hot {
    union {
        struct {
            /* The current running task_struct */
            struct task_struct  *current_task;

            /* The top of the kernel stack for the current task */
            unsigned long       top_of_stack;

            /* Used for user/kernel transition (GS base swap) */
            unsigned long       user_stack_ptr;

            /* ... other high-frequency data ... */
        };
        u8  pad[64]; /* Padded to a full cache line */
    };
};

DEV Community

A Deep Dive into Kernel Context Switching