Shyam Kumar

Posted on Mar 26

RTOS Scheduling — What Nobody Told You

#webdev #programming #embedded #learning

Motor controller. Production hardware. The system would run perfectly for hours, then suddenly freeze — no crash, no fault handler, just... silence. The watchdog would reset everything and we'd be back in business. For about six hours.

It took me two days to figure out it was a priority inversion. A low-priority logging task was holding a mutex that a high-priority motor task needed. A medium-priority CAN handler kept preempting the logger. The motor task starved. The watchdog fired.

I'd been writing RTOS code for a year at that point. I thought I understood scheduling. I did not. This post is everything I learned the hard way — written for you, so you don't have to spend two days staring at a logic analyzer.

<10µs
Context switch on Cortex-M4 @ 168MHz

O(1)
Task selection — single CLZ instruction

69.3%
Max CPU utilization under RMA (n→∞)

First, Let's Kill a Myth

The most common thing I see new embedded engineers get wrong: they think an RTOS makes their system faster. It doesn't. It makes timing predictable. That's a completely different thing, and it's the whole point.

On a bare-metal superloop, everything runs in sequence. Your 100ms display refresh sits right next to your 1ms motor update. You're always one long function away from missing a deadline. You can go fast — but you can't go on time.

An RTOS gives you a contract: 'A task with sufficient priority will preempt whatever is running within a bounded window.' That bound is what makes real-time systems work. Not speed. Predictability.

Think of the RTOS scheduler as a function that runs at every tick and after every blocking call, answering one question: 'Which task should own the CPU right now?' In a fixed-priority system, the answer is always: 'the highest-priority task that is Ready.' Everything else is just implementation detail.

The moment you need to service a CAN interrupt, update motor PWM, and refresh a display — all with distinct timing requirements — you need a scheduler. That's it. If your timing requirements are all the same, a superloop is fine. The second they diverge, you need this.

What the Scheduler Actually Does

Most tutorials show you a pretty diagram of task states and then show you xTaskCreate(). That's skipping the good part. Let me show you the actual scheduler code — FreeRTOS, stripped to its bones.

tasks.c — vTaskSwitchContext (simplified)
C
/* This is the entire scheduler. Seriously. */
void vTaskSwitchContext( void )
{
    UBaseType_t uxTopPriority;

    /* Find the highest-priority bit set in the ready-list bitmap */
    portGET_HIGHEST_PRIORITY( uxTopPriority, uxTopReadyPriority );

    /* Pick the first task at that priority level */
    listGET_OWNER_OF_NEXT_ENTRY( pxCurrentTCB,
        &( pxReadyTasksLists[ uxTopPriority ] ) );
}

/* On ARM Cortex-M, portGET_HIGHEST_PRIORITY expands to:
   uxTopPriority = ( 31UL - __clz( uxReadyPriorities ) );

   That's it. One CLZ instruction. O(1). No matter how many tasks.
   This is why RTOS context switches are so fast. */

Task States — Draw This on Paper

Seriously, take five minutes and draw the task state machine on paper. More RTOS bugs come from not having this mental model clearly loaded than from any API misuse. Here it is:

The thing people get wrong: they think 'BLOCKED' means the task is stuck. It's not stuck — it's efficiently parked. A blocked task consumes zero CPU. It's sitting in a list, waiting for something specific. The scheduler doesn't even look at it until that event fires.

This is why RTOS-based systems can run dozens of tasks on a Cortex-M4 with sub-millisecond response times and still have 90% CPU headroom. Most tasks are blocked most of the time. The scheduler only runs tasks that have something to do.

The Tick Rate Trap (I've Seen This a Dozen Times)
vTaskDelay(1) does not delay for 1 millisecond. It delays for 1 tick. If your tick rate is 100Hz, that's 10ms. Always use pdMS_TO_TICKS(1) and always know what your tick rate is configured to. I once inherited a codebase where someone had set configTICK_RATE_HZ to 10 — every vTaskDelay(100) was actually a 10-second delay. Fun to debug.

Context Switching in Assembly

This is the part most tutorials skip. A context switch isn't magic — it's assembly code that saves every CPU register from the current task onto its stack, calls the scheduler to pick the next task, then restores every register from the new task's stack. That's it.

On ARM Cortex-M, the hardware helps you out. When an interrupt fires (including the PendSV interrupt the RTOS uses for scheduling), the CPU automatically pushes 8 registers onto the current stack before jumping to your ISR. The RTOS just has to handle the rest.

port.c — PendSV_Handler, ARM Cortex-M4F (annotated)
ASM
PendSV_Handler:
    ; Hardware already saved: xPSR, PC, LR, R12, R3, R2, R1, R0
    ; Those 8 are free. Now we save the rest.

    MRS     R0, PSP              ; Get current task's Process Stack Pointer
    LDR     R3, =pxCurrentTCB   ; Load address of current TCB pointer
    LDR     R2, [R3]            ; R2 = current TCB

    VSTMDB  R0!, {S16-S31}      ; Save FPU registers (if task used FPU)
    STMDB   R0!, {R4-R11, R14}  ; Save R4-R11 + EXC_RETURN value
    STR     R0, [R2]            ; Save updated stack pointer back into TCB

    ; ↑ Current task is now fully frozen. Stack holds its entire world.

    BL      vTaskSwitchContext   ; Pick the next task (updates pxCurrentTCB)

    ; ↓ Restore the new task. It was frozen exactly like this at some point.

    LDR     R1, [R3]            ; R1 = new TCB (pxCurrentTCB was updated)
    LDR     R0, [R1]            ; R0 = new task's saved stack pointer
    LDMIA   R0!, {R4-R11, R14}  ; Restore R4-R11 + EXC_RETURN
    VLDMIA  R0!, {S16-S31}      ; Restore FPU registers
    MSR     PSP, R0             ; Update Process Stack Pointer

    BX      R14                 ; Return from exception.
                                ; Hardware restores the other 8 registers.
                                ; CPU is now running the new task as if
                                ; it was never interrupted. Magic. (It's not magic.)

The Task Control Block (TCB) is a struct whose first member is always the saved stack pointer. That's not an accident — it means the assembly above can find it at offset zero without needing to know anything else about the struct layout. Clean engineering.

Stack Sizing Is Not a Guess
Each task needs enough stack for its own local variables, its call chain, AND the full context save shown above (17 core registers + 16 FPU registers = 33 words = 132 bytes just for the context frame, before your code does anything). Undersize the stack and you get silent memory corruption — the overflow writes into whatever is adjacent in RAM. Enable configCHECK_FOR_STACK_OVERFLOW 2 in every development build. Every one.

Priority Inversion — Back to My Bug
Remember the motor controller bug from the intro? Let me walk you through exactly what happened — because understanding this scenario saves careers.

In 1997, the Mars Pathfinder landed on Mars. Within a few days, the system started resetting itself. Telemetry showed a watchdog timeout. NASA engineers pored over the data, running the same software on Earth, trying to reproduce it. Eventually they found it: a priority inversion between a low-priority meteorological task holding an information bus mutex, a medium-priority communications task preempting it repeatedly, and a high-priority bus manager task starving as a result. The fix — enabling priority inheritance in VxWorks, a single config flag — was uploaded to a spacecraft 190 million kilometres away. It worked.

Here's the exact scenario. Three tasks, one mutex:

The scheduler isn't broken. It's doing exactly what you told it to do. HIGH is blocked (legitimately waiting on a mutex). MEDIUM is ready. So MEDIUM runs. The scheduler cannot know that MEDIUM's execution is indirectly preventing HIGH from getting what it needs.

The fix is priority inheritance: when HIGH blocks on a mutex held by LOW, temporarily raise LOW's priority to match HIGH's. Now MEDIUM can't preempt LOW. LOW finishes, releases the mutex, its priority drops back to normal, and HIGH gets what it was waiting for.

The fix — one word difference, massive impact
C
/* ❌ WRONG — binary semaphore has NO priority inheritance */
xMutex = xSemaphoreCreateBinary();

/* ✅ CORRECT — mutex implements priority inheritance automatically */
xMutex = xSemaphoreCreateMutex();

/* That's it. That's the entire fix.
   When a high-priority task blocks on this mutex, FreeRTOS
   automatically boosts the holding task's priority.
   No code changes anywhere else required. */

/* The general rule: if you're protecting a shared resource
   (SPI bus, I2C peripheral, buffer, state) — use a MUTEX.
   Binary semaphores are for signalling, not resource protection. */

Scheduling Algorithms — Your Actual Options

Most embedded RTOS implementations give you Fixed-Priority Preemptive Scheduling. Tasks have static priorities. Highest-priority ready task runs. Higher-priority tasks preempt lower ones immediately. Clean. Simple. Auditable. Use it.

There's a theorem behind priority assignment called Rate Monotonic Analysis (RMA): assign higher priority to tasks with shorter periods. A task running every 1ms gets a higher priority than one running every 10ms. This is provably optimal — if any fixed-priority assignment can meet all deadlines, the rate-monotonic assignment will too.

📐 The Utilization Bound Formula (Worth Memorising)
For n tasks, the system is provably schedulable if:

U = Σ (C_i / T_i) ≤ n(2^(1/n) − 1)

Where C_i = worst-case execution time, T_i = period. As n grows, this approaches ln(2) ≈ 69.3%. If your total utilization is under ~70%, you're almost certainly fine. Over 70%, you need to run Response Time Analysis to be sure.

Property	Cooperative	Preemptive
Context switches when	Task explicitly yields	Any tick or ISR return
Interrupt latency	Unbounded	Bounded (≤1 tick)
Race conditions	Fewer — natural protection	Must use mutexes
Hard real-time	No	Yes
Debug difficulty	Easier	Timing-dependent bugs
When to use	Tiny MCUs, all tasks same priority	Anything with mixed timing requirements

In practice, use preemptive. Cooperative scheduling is a useful teaching tool and occasionally appropriate for deeply resource-constrained parts, but if you're on a Cortex-M with an RTOS, you want preemption. You're not writing an Arduino sketch.

Patterns That Actually Work in Production

1. Keep ISRs Stupid Short

An ISR that does actual work is a bug waiting to happen. You're running at interrupt priority — you can't use most FreeRTOS APIs, you can block the scheduler, and you're eating into every other interrupt's latency. Post to a queue and return. Let a task do the work.

The deferred interrupt pattern — the right way to handle hardware events
C
/* ISR: as short as humanly possible */
void USART1_IRQHandler( void )
{
    BaseType_t xHigherPriorityTaskWoken = pdFALSE;
    uint8_t   byte = USART_ReceiveData( USART1 );

    xQueueSendFromISR( xRXQueue, &byte, &xHigherPriorityTaskWoken );

    /* If posting unblocked a higher-priority task, trigger a context
       switch before this ISR returns. No delay. Immediate handoff. */
    portYIELD_FROM_ISR( xHigherPriorityTaskWoken );
}

/* Task: does the actual work at task priority */
void vUARTTask( void *pvParams )
{
    uint8_t byte;
    for(;;) {
        /* Zero CPU usage while nothing arrives */
        if( xQueueReceive( xRXQueue, &byte, pdMS_TO_TICKS(100) ) )
            vProcessByte( byte );
    }
}

2.Use vTaskDelayUntil — Not vTaskDelay

This one bites people constantly. vTaskDelay() starts counting from when the task wakes up. So if your task body takes 2ms and you delay for 10ms, your actual period is 12ms. And it drifts. Use vTaskDelayUntil() for anything periodic — it measures from the last wake time, so execution time doesn't accumulate into your period.

periodic_task.c — do it this way, every time
C
void vMotorControlTask( void *pvParams )
{
    TickType_t xLastWake = xTaskGetTickCount();
    const TickType_t xPeriod = pdMS_TO_TICKS( 1 );  /* 1ms hard */

    for(;;) {
        vTaskDelayUntil( &xLastWake, xPeriod );
        /* Blocks until (xLastWake + xPeriod), then updates xLastWake.
           Even if you ran long last iteration, the next wake time
           is still correct. No drift. */

        vReadEncoders();
        vRunPIDLoop();
        vSetPWMOutputs();
    }
}

3.Enable Stack Overflow Detection — Always

FreeRTOSConfig.h + hooks.c
C
/* FreeRTOSConfig.h — method 2 fills stack with 0xA5 pattern
   and checks it on every context switch. Catches it early. */
#define configCHECK_FOR_STACK_OVERFLOW  2

/* hooks.c */
void vApplicationStackOverflowHook( TaskHandle_t xTask, char *pcName )
{
    /* Stack is corrupt — do NOT call any FreeRTOS API here */
    taskDISABLE_INTERRUPTS();
    /* Hang. Let the debugger catch it or watchdog reset.
       pcName will tell you which task overflowed.
       Then go back and double its stack size. */
    for(;;);
}

/* After a while, use this to check headroom in normal operation: */
UBaseType_t remaining = uxTaskGetStackHighWaterMark( xMyTask );
/* If this is under ~50 words, size up. */

Mistakes I've Seen (And Made)

No judgment here. These are real mistakes from real systems, some of them mine.

The Mistake	What You'll See	The Fix
Calling blocking API from ISR	Hard fault, immediate crash, watchdog reset	Use xQueueSendFromISR() and friends
Binary semaphore as mutex	Intermittent timing violations, priority inversion	xSemaphoreCreateMutex() — always
vTaskDelay() for periodic tasks	Gradual period drift, cumulative jitter	vTaskDelayUntil() — no exceptions
Stack too small	Corrupted globals, random crashes, hours of debugging	Enable overflow check, use watermark API
Acquiring mutexes in different orders	Deadlock. Full stop. System hangs forever.	Global mutex acquisition order. Document it. Enforce it.
Task that never blocks	Everything starves. System appears frozen at lower priorities	Every task must call a blocking API somewhere in its loop
Using FreeRTOS API before vTaskStartScheduler()	Silent corruption, crash on first switch	Initialise hardware in main, start everything in tasks

Conclusion

Embedded systems engineering in India has matured fast. The engineers writing firmware for automotive ECUs, medical devices, and industrial controllers across Bangalore, Pune, and Hyderabad are dealing with exactly these problems — priority inversion on a CAN bus, stack overflows at 3am, a watchdog nobody can explain.
Institutes like the Indian Institute of Embedded Systems (IIES) exist because this gap between knowing C and understanding what the kernel is actually doing underneath is real, and it costs production hours.
But no course closes that gap alone. The concepts in this post — task states, context switching, priority inheritance — only become instinct after you've broken something in production and had to find it.
Read the theory. Then go write a task, starve it on purpose, and debug it yourself. That's the part nobody can teach you.

Top comments (3)

mote • Apr 15

The priority inversion bug you described with the motor controller is one of those lessons that just has to be learned the hard way — no amount of documentation prepares you for it.

One thing I'd add about priority inheritance: even after you understand it conceptually, the tricky part is the cascading effect. If you have A → B → C mutex dependencies and C holds the lowest priority, you're actually doing multi-hop priority propagation. Most RTOS implementations handle this, but knowing your implementation handles it (and testing that it does) is a different matter.

The section on superloops vs RTOS framing is exactly right — 'predictable, not faster' is the key insight. I've seen so many engineers add an RTOS expecting speed gains and then wonder why their system didn't improve. The real win is determinism.

Question: when you hit the watchdog-fired priority inversion, how did you eventually diagnose it? Static analysis, runtime tracing, or just reading the FreeRTOS source until something clicked?

Shyam Kumar • Apr 20

Good point on the cascading effect, that’s where priority inheritance stops being “simple” and becomes implementation-dependent. Multi-hop propagation (A → B → C) is something you definitely want to verify, not assume.

In my case, the watchdog resets initially looked like a timing issue under load. The breakthrough came from runtime tracing, not static analysis. I observed a high-priority task blocking far longer than expected, which led me to the mutex chain. It turned out to be a classic priority inversion: a low-priority task holding the lock, preempted by a medium-priority task, starving the high-priority one.

I did review the RTOS behavior afterward, but the root cause was identified through observing task interactions at runtime.

Saveyourproject • Jun 27

This is a good push toward RTOS. Worth adding for anyone earlier on: on a hobby ESP32, non-blocking millis() timers plus one small state machine per subsystem get you surprisingly far. They clear up most "it randomly freezes" bugs before you need real preemption. Where do you draw the line, RTOS vs a clean super-loop?