우병수

Posted on May 21 • Originally published at techdigestor.com

Stackless Coroutines in ~200 Lines of C++: How I Manage Game State Without the Headaches

#productivity #tools #webdev #discuss

TL;DR: The cutscene example is one I keep coming back to because it's so brutally mundane — yet it exposes the exact fracture between how programmers think about game logic and how we're forced to implement it. You want to express: walk to door, wait 2 seconds, open door, play dial

📖 Reading time: ~37 min

What's in this article

The Problem: Game Logic That Fights Your Code
What 'Stackless' Actually Means Here (And Why You Should Care)
The Core Mechanism: A State Struct + A Resume Function
Building the Scheduler: ~60 Lines That Run Everything
Writing Your First Real Coroutine: A Door-Opening Sequence
Spawning Child Coroutines and Waiting on Them
The Full ~200-Line Implementation (Annotated)

The Problem: Game Logic That Fights Your Code

The cutscene example is one I keep coming back to because it's so brutally mundane — yet it exposes the exact fracture between how programmers think about game logic and how we're forced to implement it. You want to express: walk to door, wait 2 seconds, open door, play dialogue. That's one coherent thought. What you end up writing instead is a state machine like this:

enum class CutsceneState {
    WALKING_TO_DOOR,
    WAITING_AT_DOOR,
    OPENING_DOOR,
    DOOR_OPEN_ANIM,
    DIALOGUE_STARTING,
    DIALOGUE_PLAYING
};

CutsceneState state = CutsceneState::WALKING_TO_DOOR;
float waitTimer = 0.0f;  // only used in one state, exists forever

void CutsceneSystem::Update(float dt) {
    switch (state) {
        case CutsceneState::WALKING_TO_DOOR:
            if (character.ReachedTarget()) {
                state = CutsceneState::WAITING_AT_DOOR;
                waitTimer = 0.0f;
            }
            break;
        case CutsceneState::WAITING_AT_DOOR:
            waitTimer += dt;
            if (waitTimer >= 2.0f)
                state = CutsceneState::OPENING_DOOR;
            break;
        // ... four more cases
    }
}

The waitTimer variable is the tell. It only matters for 2 seconds of the entire cutscene, but it lives in your struct or on your class forever, participating in zero other states, silently leaking intent. Multiply this across 40 cutscenes, 20 AI behaviors, and 15 tutorial sequences and you've got a codebase where the original programmer's intent is completely buried. A new team member has to reconstruct the what by reading the how. That's a maintenance tax you pay indefinitely.

The real pathology isn't the state count — it's that a single logical sequence is physically fragmented across unrelated callbacks, flags, and timer variables. The "walk to door" step and the "open door" step are causally linked in your head, but there's no link in the code. You can't read one function and understand the flow. You read six cases and mentally simulate the state machine. I've seen experienced devs spend 20 minutes debugging a cutscene that "stops halfway" only to find a flag that was never reset from a previous run. The bug isn't complex; the scaffolding just ate it.

C++20 coroutines (co_await, co_yield, promise types, the whole machinery) sound like the obvious fix. And they can work. But I've been on teams shipping to Nintendo Switch and PlayStation, and std::coroutine-based implementations come with a real cost. The compiler support was spotty until relatively recently on console toolchains. More importantly, the abstraction is leaky — you end up writing or importing a coroutine frame allocator, a task scheduler, an awaitable type hierarchy. The cognitive overhead of explaining to a mid-level dev why their coroutine needs a promise type with specific return behavior before they can just write co_await Wait(2.0f) is genuinely high. Stackful coroutines (via Boost.Context or similar) solve the sequencing problem but each one allocates a full stack — typically 8KB minimum — which is a non-starter when you have hundreds of concurrent AI behaviors.

What we actually want is the simplest possible thing that lets us write sequential-looking logic that suspends between frames. No heap allocation per coroutine. No external library. No promise types. The target is code that reads like this:

Coroutine RunCutscene(Context& ctx) {
    co_walk_to(ctx, doorPosition);
    co_wait(ctx, 2.0f);
    co_open_door(ctx, door);
    co_play_dialogue(ctx, "Hello there.");
}

...where each of those lines can suspend execution and resume next frame (or several frames later), all the local state lives on whatever stack-equivalent we build ourselves, and the whole mechanism fits in a single header under 200 lines. The implementation trick is using Duff's Device — a legal C switch/case fallthrough pattern — to build a resumable function without any compiler coroutine support at all. Ugly at the macro level, completely transparent at the usage level.

What 'Stackless' Actually Means Here (And Why You Should Care)

The thing that trips most people up: "stackless" doesn't mean there's no stack at all — your coroutine still runs on the regular call stack while it's active. What it means is that when you suspend it, you don't save the entire call stack frame. You only save what you explicitly put in the state struct. That's the whole trick.

Stackful coroutines — Boost.Context, Windows fibers, ucontext on Linux — work by allocating a separate stack per coroutine, typically 64KB to 1MB each. Then on suspend, the runtime saves registers and swaps stack pointers. It's genuinely powerful: you can suspend from anywhere, including deep inside helper function calls. But the cost is real. Allocating 128 enemy AI coroutines at 64KB each is 8MB of stack memory before you've written a single behavior. Context-switching has measurable overhead. And platform-specific assembly glue means every new target (PS5, Switch, Android) is a porting headache.

With stackless coroutines, your coroutine's "paused" state is just a plain struct sitting in a pool you control. My typical coroutine state for a patrol enemy looks like this:

struct PatrolCoroutine {
    // Explicit state: only what survives a suspend
    int phase;         // which resume label we jump to
    float timer;       // persists across yields
    Vec3 targetPos;    // set before first yield, read after
    int waypointIdx;
};
// sizeof(PatrolCoroutine) == 20 bytes
// sizeof(a fiber doing the same thing) == 65536+ bytes

The hard constraint: you cannot suspend from inside a nested function call. If your coroutine calls findPath() and you want to yield while it runs async, findPath itself has to be a coroutine — or you restructure to kick off the async op, yield, and check results on the next resume. That sounds painful but in practice 90% of game scripting is a flat sequence: wait for animation, do thing, wait for condition, do next thing. The nesting problem almost never comes up in that pattern.

Two common implementations of stackless coroutines in C without language support both use the same underlying trick — storing a line number and jumping back to it. Protothreads and similar libraries use Duff's Device, which looks like this:

// Duff's Device approach (protothreads-style)
void update(State* s) {
    switch(s->line) {
        case 0:
        // ... code
        s->line = __LINE__; return;
        case __LINE__:;
        // continues...
    }
}
// Works, but: impossible to read in a debugger,
// local vars inside the switch get weird scoping,
// and stepping through it in GDB is maddening.

I don't use Duff's Device for anything beyond toy examples. The explicit state-struct approach I'm walking through here keeps all state named, typed, and visible in any debugger. You can pause a running game, inspect patrol_state.waypointIdx, and know exactly what the AI was doing. With Duff's Device your debugger shows you a switch statement with numeric labels and local variables that may or may not be in scope depending on which compiler you're using. That debugging story alone is worth the slightly more verbose setup.

The Core Mechanism: A State Struct + A Resume Function

The thing that trips people up when they first see this pattern is how simple the underlying idea is. There's no heap allocation, no virtual dispatch, no scheduler thread. A stackless coroutine is literally a struct that remembers where it left off, and a function that jumps back to that spot. That's the whole trick.

Every coroutine you write will have two things: a data struct containing whatever "local" state survives across yields (current waypoint index, elapsed time, a cached pointer, whatever the coroutine actually needs), and a resume() function with a big switch at the top that dispatches to the right point in the function body. The "program counter" is just an int stored in the struct — typically called _pc or line. When the coroutine first runs, it's 0. Every time you hit a yield point, you save the current line number and return. Next call, the switch jumps straight back to that line.

The classic macro set that makes this ergonomic looks like this — and I'd argue these 20-ish lines are doing more work than most game engine features ten times their size:

// coroutine.h — the entire stackless coroutine machinery

struct Coroutine {
    int _pc = 0;       // program counter: which yield point to resume at
    bool done = false; // set true when coroutine reaches COROUTINE_END
};

// Jump back into the body at the saved program counter.
// We store __LINE__ *before* the case label so the next resume
// lands on the line *after* the YIELD.
#define COROUTINE_BEGIN(co) \
    switch ((co)._pc) { case 0:

#define YIELD(co) \
    do { \
        (co)._pc = __LINE__; return; \
        case __LINE__:; \
    } while (0)

#define COROUTINE_END(co) \
    (co).done = true; \
    (co)._pc  = 0; \
    } // closes the switch

// Usage sketch — your coroutine state struct embeds Coroutine:
// struct PatrolCo : Coroutine { int waypointIdx = 0; float timer = 0; };
// void patrol_resume(PatrolCo& co, float dt) {
//     COROUTINE_BEGIN(co);
//     ... code ...
//     YIELD(co);
//     ... more code ...
//     COROUTINE_END(co);
// }

Why __LINE__? Because it's a unique integer at every source line, which is exactly what you need for a switch case label. The trick in YIELD is that the case __LINE__: appears after the return, inside the do…while(0) block. The C++ switch statement doesn't care that the label is inside a nested block — the jump still works because switch does a raw computed goto under the hood. This is Duff's Device territory, and yes, it's fully standard C++.

The one real gotcha: two YIELD calls on the same source line will expand to duplicate case labels and your compiler will yell at you. This never happens in normal code, but it will happen if you write a macro that internally calls YIELD twice and you invoke that macro on one line, or if you're doing something weird with semicolons. The fix is trivial — just put each YIELD on its own line (you should be doing this anyway for readability). If you're wrapping YIELD inside a higher-level macro like WAIT_SECONDS, verify the expansion always results in at most one YIELD per macro invocation. Here's a concrete example of the breakage and the fix:

// BROKEN: both YIELDs expand to case __LINE__ with the *same* number
void bad(MyCo& co) {
    COROUTINE_BEGIN(co);
    YIELD(co); YIELD(co);  // same line → duplicate case labels → compile error
    COROUTINE_END(co);
}

// FIXED: one yield per line, no ambiguity
void good(MyCo& co) {
    COROUTINE_BEGIN(co);
    YIELD(co);
    YIELD(co);  // __LINE__ is different now — two distinct case labels
    COROUTINE_END(co);
}

One honest limitation to name up front: you cannot declare local variables between COROUTINE_BEGIN and the first YIELD and expect them to survive. The switch jump skips their initializers on re-entry. Anything that needs to persist across a yield must live in the coroutine struct itself — which is actually fine for game AI, since you'd want that data visible for debugging anyway. This constraint forces a clean separation between transient scratch values and real coroutine state, and I've come to prefer it over the implicit closure captures you get with stackful coroutines.

Building the Scheduler: ~60 Lines That Run Everything

The scheduler is where the magic either holds together or falls apart. I've seen people build elaborate priority queues and dependency graphs here — resist that urge. The simplest version that covers real game logic is a flat array and a loop. Here's the core:

class CoroutineScheduler {
    static constexpr size_t MAX_COROUTINES = 64;
    Coroutine* slots[MAX_COROUTINES] = {};
    size_t count = 0;

public:
    void add(Coroutine* c) {
        assert(count < MAX_COROUTINES);
        slots[count++] = c;
    }

    void update() {
        size_t i = 0;
        while (i < count) {
            bool still_alive = slots[i]->resume();
            if (!still_alive) {
                // Swap with last to avoid shifting — order doesn't matter here
                slots[i] = slots[--count];
                slots[count] = nullptr;
            } else {
                ++i;
            }
        }
    }
};

That swap-with-last trick is important. Shifting elements on removal is O(n) per removal and introduces bugs when you remove multiple coroutines in the same frame — the swap makes removal O(1) and you don't have to think about it. The thing that caught me off guard the first time I wrote this: if you increment i after a swap, you'll skip the coroutine that just moved into slots[i]. Only advance i when you didn't remove.

For memory, you have two honest choices. The static pool version allocates a fixed arena at startup and placement-news coroutines into it — zero heap involvement, deterministic, perfect for consoles and embedded:

// Pool approach — fixed max, but zero fragmentation, zero allocation in hot path
alignas(alignof(std::max_align_t)) char pool[MAX_COROUTINES * sizeof(MyCoroutine)];
size_t pool_used = 0;

MyCoroutine* alloc_coroutine() {
    assert(pool_used < MAX_COROUTINES);
    return new (pool + pool_used++ * sizeof(MyCoroutine)) MyCoroutine();
}

The std::vector<std::unique_ptr<Coroutine>> version is about 5 lines and handles mixed coroutine types easily — use it for tools, editors, or anything running on PC where you genuinely don't know the max count. My honest take: if you're writing gameplay AI or cutscene logic, hit the pool. If you're writing an editor scripting system, the vector is fine and you'll iterate faster. Don't pool-optimize a tool that runs at 60hz with 8 active coroutines.

The two yield variants that cover real cases are frame-wait and condition-wait. Wire them into a YieldRequest struct your coroutine fills out before suspending:

struct YieldRequest {
    enum class Type { FrameWait, Condition } type;
    int frames_remaining;                        // for FrameWait
    std::function<bool()> condition;             // for Condition — evaluated each frame
};

// Inside update(), before calling resume(), check if the coroutine is still waiting:
bool CoroutineScheduler::should_resume(Coroutine* c) {
    auto& req = c->pending_yield;
    if (req.type == YieldRequest::Type::FrameWait) {
        if (--req.frames_remaining > 0) return false;
    } else if (req.type == YieldRequest::Type::Condition) {
        if (!req.condition()) return false;
    }
    return true;
}

The std::function in YieldRequest does allocate if your lambda captures more than ~16 bytes (implementation-defined, but clang and MSVC both do this). For a pool-based system where you want zero allocation, replace it with a raw function pointer plus a void* userdata field. Slightly more verbose at the call site, but you keep full control. The frame-wait path is allocation-free regardless — it's just a counter decrement, and that's fast enough that you don't need to batch it.

Writing Your First Real Coroutine: A Door-Opening Sequence

The most convincing argument for coroutines isn't a benchmark — it's reading code that does five sequential things and actually looking sequential. Here's a door-opening sequence that covers everything a real game needs: movement, waiting, animation gating, and triggering a side effect.

// door_sequence.h — the coroutine state lives here as a flat struct
struct DoorCoroutine {
    CoroutineState coro;      // your ~200-line machinery lives here
    EntityID       actor;     // NOT a pointer — we'll explain why below
    EntityID       door;
    Vec2           target_pos;
    int            wait_timer;
    int            anim_frame_start;
};

// The coroutine body — reads like a script
COROUTINE_FUNC(door_open_sequence, DoorCoroutine* self) {
    CORO_BEGIN(self->coro);

    // Walk to door. Yields every frame until actor is close enough.
    // GetEntity() does a live lookup — safe even if world ticks between yields.
    YIELD_WHILE(
        Vec2Distance(GetEntity(self->actor)->pos, self->target_pos) > 4.0f
        && MoveToward(self->actor, self->target_pos, ACTOR_SPEED)
    );

    // Wait 60 frames (1 second at 60 Hz) — no timer variable on the caller side
    self->wait_timer = 0;
    YIELD_WHILE(self->wait_timer++ < 60);

    // Kick off the door-open animation, then yield until it finishes
    PlayAnimation(self->door, ANIM_DOOR_OPEN);
    self->anim_frame_start = GetEntity(self->door)->anim_frame;
    YIELD_WHILE(GetEntity(self->door)->anim_playing);

    // Fire dialogue — by this point we KNOW the door is open
    TriggerDialogue(self->actor, DIALOGUE_DOOR_OPENED);

    CORO_END(self->coro);
}

Now look at the state-machine equivalent I ripped out when I rewrote our AI system. Same behavior, but managing it by hand means carrying all of this in your head at once:

// The before: enum + 3 timers + 2 flags = death by a thousand variables
typedef enum {
    DOOR_STATE_WALKING,
    DOOR_STATE_WAITING,
    DOOR_STATE_ANIMATING,
    DOOR_STATE_DONE
} DoorState;

typedef struct {
    DoorState state;
    int       wait_frames_elapsed;   // timer #1
    int       anim_frame_start;      // timer #2 (not even a timer, but used as one)
    float     last_distance;         // timer #3 — debounce for arrival detection
    bool      animation_triggered;   // flag #1
    bool      dialogue_fired;        // flag #2 — because DONE state can re-run by accident
} DoorStateMachine;

// Update function: every new feature needs another case, another flag,
// another "oh wait, what happens if state == ANIMATING but door entity is gone?"
void DoorSM_Update(DoorStateMachine* sm, Entity* actor, Entity* door) {
    switch (sm->state) {
        case DOOR_STATE_WALKING:
            if (Vec2Distance(actor->pos, door->trigger_pos) <= 4.0f)
                sm->state = DOOR_STATE_WAITING;
            else
                MoveToward(actor, door->trigger_pos, ACTOR_SPEED);
            break;
        case DOOR_STATE_WAITING:
            if (++sm->wait_frames_elapsed >= 60)
                sm->state = DOOR_STATE_ANIMATING;
            break;
        case DOOR_STATE_ANIMATING:
            if (!sm->animation_triggered) {
                PlayAnimation(door, ANIM_DOOR_OPEN);
                sm->animation_triggered = true;
            }
            if (!door->anim_playing) {
                if (!sm->dialogue_fired) {
                    TriggerDialogue(actor, DIALOGUE_DOOR_OPENED);
                    sm->dialogue_fired = true;
                }
                sm->state = DOOR_STATE_DONE;
            }
            break;
        case DOOR_STATE_DONE: break;
    }
}

The state machine has 4 enum values, 3 timing variables, and 2 guard booleans — and this is a simple sequence. Add a "wait for player to be looking at door" step and you're adding another state, probably another flag, and debugging why the dialogue fires twice on fast machines. The coroutine version just gets a new YIELD_WHILE line.

Pitfall #1: storing raw pointers to game objects. The thing that bit me immediately was this pattern:

// DANGEROUS — do not store raw pointers in coroutine state
struct DoorCoroutine {
    CoroutineState coro;
    Entity*        actor;   // ❌ actor could be freed between any two yield points
    Entity*        door;    // ❌ same problem
};

Your coroutine suspends across frames. Between YIELD_WHILE on frame 100 and resumption on frame 101, the player could have walked into a kill zone, the entity pool could have recycled that slot, and your pointer now points at a different enemy wearing the same memory address. The fix is to store IDs and resolve them through a live lookup each time:

// SAFE — ID is stable; lookup returns NULL if entity was destroyed
struct DoorCoroutine {
    CoroutineState coro;
    EntityID       actor;   // ✅ just a uint32_t
    EntityID       door;    // ✅
};

// In the coroutine body, always go through the registry:
Entity* a = GetEntity(self->actor);
if (!a) { CORO_ABORT(self->coro); }  // entity gone — bail cleanly

Add a null-check right after every YIELD_WHILE that depends on an entity. It's one line, and it means your coroutine can survive entity death gracefully instead of crashing into freed memory three frames later.

Pitfall #2: yielding inside a loop that mutates shared state. YIELD_WHILE doesn't pause a loop — it exits the coroutine function entirely and restarts from the same point next frame. If you write something like this:

// This looks innocent but has a subtle bug
for (int i = 0; i < enemy_count; i++) {
    DamageEnemy(enemies[i], 10);
    YIELD_WHILE(enemies[i]->hurt_anim_playing);  // ❌ i is on the C stack, not in coroutine state
    // On resume: i is re-initialized to 0 by CORO_BEGIN's switch/jump
    // You'll damage enemy[0] on every resume, never reaching enemy[1]
}

The coroutine's jump table re-enters at the YIELD_WHILE line, but the for loop's counter i is a local variable that gets re-initialized. You need to hoist the iterator into the struct:

// Safe version — loop counter lives in coroutine state, not on the stack
self->enemy_index = 0;
YIELD_WHILE(({
    if (self->enemy_index < enemy_count) {
        DamageEnemy(enemies[self->enemy_index], 10);
        // wait for this one to finish, then advance
        if (!enemies[self->enemy_index]->hurt_anim_playing)
            self->enemy_index++;
        true;   // keep yielding until all done
    } else {
        false;  // all enemies processed, stop yielding
    }
}));

The general rule: anything that needs to survive a yield point must live in the struct, not on the C stack. Local variables declared inside the coroutine body are fine for single-frame computations, but the moment you put a yield point after them, they're gone on the next resume. This is the sharpest edge on the entire technique — it doesn't cause a compile error, just wrong behavior.

Spawning Child Coroutines and Waiting on Them

The parent-waits-on-child pattern is where stackless coroutines go from "neat trick" to actually replacing your state machine hierarchy. The mental model is simple: spawn a child coroutine into the scheduler, hang on to its ID, then sit in a YIELD_WHILE loop checking if it's still alive. The parent suspends itself each frame, the scheduler runs both parent and child, and eventually the child finishes and the parent resumes. No threads, no callbacks, no manually threading a "finished" flag through five layers of state.

Here's what the scheduler extension looks like. You need two things: a monotonically incrementing ID assigned at spawn time, and an is_alive(id) query that doesn't blow up if the ID has already been cleaned up.

// Extend your Coroutine struct
struct Coroutine {
    std::function<void()> resume;
    bool finished = false;
    uint32_t id = 0;       // assigned at spawn, never reused
};

class Scheduler {
    std::vector<Coroutine> coros;
    uint32_t next_id = 1;  // 0 reserved as "invalid"

public:
    uint32_t spawn(std::function<void()> fn) {
        uint32_t id = next_id++;
        coros.push_back({ fn, false, id });
        return id;  // caller stores this to poll later
    }

    bool is_alive(uint32_t id) const {
        for (auto& c : coros)
            if (c.id == id) return !c.finished;
        return false;  // already cleaned up — treat as done
    }

    void tick() {
        for (auto& c : coros)
            if (!c.finished) c.resume();
        // erase finished entries after the tick loop
        std::erase_if(coros, [](auto& c){ return c.finished; });
    }
};

That's the whole extension — roughly 30 lines if you include the ID field and the spawn/is_alive pair. Notice that is_alive returns false when the ID isn't found at all. That's intentional: the scheduler erases finished coroutines after each tick, so "not found" and "found but finished" are both "safe to proceed." The parent's YIELD_WHILE naturally exits in both cases.

The concrete gamedev use case that made me reach for this pattern: two AI movement coroutines running in parallel — one pathfinds to a flanking position, one plays a cover animation — and neither the story system nor the dialogue trigger should fire until both are done. With callbacks you'd be counting completed callbacks and praying they don't race. With this pattern it's explicit and readable:

COROUTINE_BEGIN(story_beat)
    // spawn both movement sub-tasks simultaneously
    uint32_t flank_id = scheduler.spawn([&]{ COROUTINE_BEGIN(flank_move) /* ... */ COROUTINE_END });
    uint32_t cover_id = scheduler.spawn([&]{ COROUTINE_BEGIN(cover_anim) /* ... */ COROUTINE_END });

    // parent suspends each frame until BOTH children are gone
    YIELD_WHILE(scheduler.is_alive(flank_id) || scheduler.is_alive(cover_id));

    // here it's safe — both finished this frame or earlier
    trigger_dialogue("ambush_setup_complete");
COROUTINE_END

The gotcha I hit the first time: I was erasing finished coroutines inside the tick loop, which invalidated iterators mid-iteration and occasionally skipped a coroutine's final resume. The fix is what you see above — mark finished during the loop, erase after. Also watch out for spawning during tick(): if your parent coroutine's resume function calls scheduler.spawn(), you're modifying coros while iterating it. Swap your loop to index-based iteration or snapshot the size at tick start and only iterate up to that index.

Keep the hierarchy shallow — two levels max in practice. A parent coroutine spawning children is fine. A child spawning grandchildren that spawn great-grandchildren will wreck you during debugging. You can't step through it in a debugger in any meaningful way, the ID lifetimes get confusing, and you'll end up with zombie IDs when a middle-layer coroutine gets killed before its children finish. The pattern works beautifully for "orchestrator spawns workers." It gets painful the moment you need to cancel a subtree mid-flight. If you genuinely need cancellation and deep nesting, that's the point where you want a proper coroutine library with structured concurrency semantics — but for most game AI and cutscene systems, flat-plus-one-level covers 95% of what you actually need.

The Full ~200-Line Implementation (Annotated)

The thing that surprised me most when I first put this together: the entire mechanism fits in a single header you can audit in five minutes. There's no magic, no compiler extensions, no co_await machinery. The suspension point trick is Duff's Device — the same idea Simon Tatham documented back in 2000, but applied to game AI instead of protocol parsers. Here's the full annotated implementation, broken into logical blocks you can copy incrementally.

The Macros Header (~30 lines)

These four macros are the entire coroutine protocol. Everything else is just C++ plumbing around them.

// coroutine_macros.hpp
// No include guards needed if you use #pragma once — but both work fine.
#pragma once

// CR_BEGIN / CR_END wrap the body of every coroutine's update() method.
// They open and close the switch statement that restores execution position.
#define CR_BEGIN  switch(_line) { case 0:
#define CR_END    } _line = -1; return false;

// CR_YIELD saves the current line, returns true (meaning "I'm still running"),
// and on the NEXT call to update(), jumps back here via the switch.
// The __LINE__ trick is why each yield point must be on its own source line.
#define CR_YIELD  do { _line = __LINE__; return true; case __LINE__:; } while(0)

// CR_WAIT suspends until a condition becomes true.
// Evaluates the condition every frame — zero heap allocation.
#define CR_WAIT(cond) do { _line = __LINE__; case __LINE__: if(!(cond)) return true; } while(0)

The do { ... } while(0) wrapper on CR_YIELD isn't pedantry — it makes the macro safe inside an if without braces. Drop it and you'll get a confusing compile error the first time someone writes if (x) CR_YIELD;. Also: __LINE__ must be unique per yield point, which means you can't put two yields on the same line. That's not a real constraint in practice, but it will produce a duplicate-case compiler error if you accidentally do it — which is actually a helpful failure mode.

Coroutine Base Class (~20 lines)

// coroutine.hpp
#pragma once
#include "coroutine_macros.hpp"

class Coroutine {
public:
    int  _line  = 0;   // execution cursor — not "private" so macros can touch it
    bool _done  = false;

    // Returns true while running, false when finished.
    // Override this in subclasses with CR_BEGIN / CR_END.
    virtual bool update() = 0;

    void reset() { _line = 0; _done = false; }

    bool done() const { return _done; }

    virtual ~Coroutine() = default;
};

I made _line public so the macros don't need friend declarations everywhere. If that offends your encapsulation sensibilities, you can template the base class and use CRTP to give each subclass access — but for a game where these live in a single translation unit, the tradeoff isn't worth it. The virtual destructor matters if you're ever storing these by base pointer (which the scheduler does), otherwise GCC 11 will warn you about it without -Wno-delete-incomplete.

CoroutineScheduler (~60 lines)

// coroutine_scheduler.hpp
#pragma once
#include "coroutine.hpp"
#include <vector>   // the ONE STL include — see note below

class CoroutineScheduler {
public:
    // Takes ownership via raw pointer. Use an arena allocator in prod.
    void add(Coroutine* cr) {
        _pending.push_back(cr);
    }

    // Call once per frame from your game loop.
    void tick() {
        // Drain pending into active to avoid iterator invalidation
        // when a coroutine spawns another coroutine during its own update().
        for (auto* cr : _pending)
            _active.push_back(cr);
        _pending.clear();

        size_t write = 0;
        for (size_t i = 0; i < _active.size(); ++i) {
            Coroutine* cr = _active[i];
            bool still_running = cr->update();
            if (still_running) {
                _active[write++] = cr;  // compact in-place, no allocation
            } else {
                cr->_done = true;
                delete cr;              // swap for pool.release(cr) on bare metal
            }
        }
        _active.resize(write);
    }

    size_t active_count() const { return _active.size(); }

    // Nukes everything — useful on scene transitions.
    void clear() {
        for (auto* cr : _active)  delete cr;
        for (auto* cr : _pending) delete cr;
        _active.clear();
        _pending.clear();
    }

    ~CoroutineScheduler() { clear(); }

private:
    std::vector<Coroutine*> _active;
    std::vector<Coroutine*> _pending;
};

The split _active/_pending buffer is the part most toy implementations get wrong. If you add a coroutine from inside another coroutine's update() call and you're iterating _active at the same time, you either corrupt the iterator or silently skip the new coroutine. The pending-flush pattern handles both cases cleanly. The in-place compaction (write index trick) avoids a temporary allocation on every tick — on a title that runs 500+ active coroutines this actually shows up in a profiler.

Removing the STL include on bare metal: Replace std::vector with a fixed-capacity intrusive list or a flat array with a sentinel. Something like this is enough for most embedded or console targets:

// Replace std::vector with a fixed-size array if malloc is unavailable.
// Tune MAX_COROUTINES to your worst-case frame budget.
static constexpr size_t MAX_COROUTINES = 256;
Coroutine* _active[MAX_COROUTINES];
size_t     _active_count = 0;

You lose the dynamic growth, but on a bare-metal target you probably want that constraint explicit anyway — a coroutine budget overrun should be a hard assertion, not a heap allocation that blows your stack three frames later.

Example Coroutines (~60 lines)

// example_coroutines.hpp
#pragma once
#include "coroutine.hpp"

// Counts elapsed frames via an external counter reference.
// Shows CR_WAIT with a stateful condition.
class WaitFrames : public Coroutine {
public:
    WaitFrames(const int& frame_counter, int wait_count)
        : _start(frame_counter), _wait(wait_count), _fc(frame_counter) {}

    bool update() override {
        CR_BEGIN;
        _start = _fc;
        CR_WAIT(_fc - _start >= _wait);
        // arrives here exactly once, the frame the wait expires
        return false;
        CR_END;
    }

private:
    int _start;
    int _wait;
    const int& _fc;
};

// A patrol AI: moves to A, waits 60 frames, moves to B, repeats.
// Demonstrates CR_YIELD inside a loop — this is the real power.
class PatrolAgent : public Coroutine {
public:
    PatrolAgent(float& x, float target_a, float target_b)
        : _x(x), _a(target_a), _b(target_b) {}

    bool update() override {
        CR_BEGIN;
        while(true) {
            _x = _a;
            CR_WAIT(false);        // immediate yield — move happens, then pause one frame
            _frames = 0;
            CR_WAIT(_frames++ >= 60);
            _x = _b;
            CR_WAIT(false);
            _frames = 0;
            CR_WAIT(_frames++ >= 60);
        }
        CR_END;
    }

private:
    float& _x;
    float  _a, _b;
    int    _frames = 0;
};

// Spawns a child coroutine and waits for it to finish.
// The parent polls done() — no callbacks, no futures.
class SpawnAndWait : public Coroutine {
public:
    SpawnAndWait(CoroutineScheduler& sched, const int& fc)
        : _sched(sched), _fc(fc) {}

    bool update() override {
        CR_BEGIN;
        _child = new WaitFrames(_fc, 30);
        _sched.add(_child);
        CR_WAIT(_child->done());
        // child is deleted by scheduler — don't touch _child after this
        return false;
        CR_END;
    }

private:
    CoroutineScheduler& _sched;
    const int& _fc;
    WaitFrames* _child = nullptr;
};

Main Game-Loop Integration (~30 lines)

// main.cpp — pseudocode-close-to-real integration
#include "coroutine_scheduler.hpp"
#include "example_coroutines.hpp"
#include <cstdio>

int main() {
    CoroutineScheduler scheduler;
    int  frame   = 0;
    float agent_x = 0.0f;

    // Spawn some work before the loop starts
    scheduler.add(new WaitFrames(frame, 120));
    scheduler.add(new PatrolAgent(agent_x, 0.0f, 100.0f));
    scheduler.add(new SpawnAndWait(scheduler, frame));

    // Your real game loop replaces this while — SDL_Event loop, Unity Update(), whatever.
    while (scheduler.active_count() > 0) {
        scheduler.tick();   // step all live coroutines once
        ++frame;

        // frame budget check: if tick() takes >0.5ms you have too many coroutines
        // or one is doing real work inside update() — keep update() lean
        if (frame % 10 == 0)
            printf("frame %d — active coroutines: %zu\n", frame, scheduler.active_count());
    }

    printf("all coroutines finished at frame %d\n", frame);
    return 0;
}

Compile this with g++ -std=c++11 -O2 -Wall main.cpp -o demo. The -O2 flag is load-bearing here — without it, the virtual dispatch on update() plus the switch jump costs a measurable amount per coroutine per frame. With -O2, GCC 11 devirtualizes the calls when the concrete type is visible in the same TU, and the switch becomes a single indirect jump. Conceptually the hot path compiles down to something like:

; Conceptual x86-64 for PatrolAgent::update() at a yield point
; The switch(this->_line) becomes:
mov  eax, DWORD PTR [r

Where This Breaks Down: Be Honest With Yourself
The yield-from-called-function limitation will bite you harder than you expect. You write a clean walk_to() helper, then try to drop a CO_YIELD inside it, and nothing works — the macro expands inside the wrong function scope, and the __LINE__-based case label is in the helper, not the coroutine's switch. Your entire coroutine body must live in a single function. No factoring out sub-behaviors into free functions that themselves yield. The workaround is to either inline everything or use sub-coroutine objects that the parent coroutine manually ticks and checks for completion — which works, but it's extra bookkeeping you have to write yourself.

Debugger experience is genuinely rough. You set a breakpoint expecting to land somewhere meaningful, and instead you're sitting at the top of the switch dispatch. GDB or LLDB will show you the switch(_state) line like it's helpful. It's not. The fix I actually use: add a const char* debug_label field to the coroutine struct, update it right before each yield point, and log it in resume():

struct EnemyCoroutine {
    int _state = 0;
    const char* debug_label = "init"; // update this before every CO_YIELD
    // ... rest of your fields
};

// In your scheduler's resume():
void resume(EnemyCoroutine& co) {
    printf("[CO] resuming '%s' from state %d\n", co.debug_label, co._state);
    co.tick(co);
}

// Inside the coroutine body, right before yielding:
self.debug_label = "waiting_for_path";
CO_YIELD(self);


Now your log output actually tells you something. It costs you one pointer per coroutine and a few characters per yield site — worth it every time.

If you're managing 50+ AI agents with branching dialogue trees, reactive behaviors, group tactics, and condition evaluators, this approach is the wrong tool. What you've built is great for linear or mildly branching sequences — patrol loops, scripted events, simple attack routines. Behavior trees handle reactive priority-based decisions better. Lua or Wren give your designers a real scripting language they can own. I've seen teams bolt stackless coroutines onto a project that needed behavior trees and end up with 800-line coroutine functions full of nested if-chains. That's worse than where they started. The threshold is roughly: if your agent's logic needs to react to world state changes mid-sequence and you have more than a handful of those agents, look at BTs before you commit.

Thread safety: there is none here, and that's intentional. The scheduler assumes it's called from one thread — typically your main game update thread or a dedicated AI tick. If you've got a job system and you're tempted to call resume() from worker threads to parallelize AI updates, you need a mutex at minimum, and you'll almost certainly hit the problem where your coroutine reads game state that another thread is mutating. At that point the architecture is wrong, not just the locking. The honest answer is: if you need parallel AI ticks, you want coroutines that operate on isolated data slices, a proper job dependency graph, or you want stackful coroutines (Boost.Context, or fibers on your target platform) where suspension is cheaper to reason about across threads.

Real Projects Using This Pattern (And What They Changed)
Protothreads by Adam Dunkels is the direct ancestor of what most game devs end up building when they roll their own stackless coroutine system. Dunkels published it around 2005 targeting embedded systems — think 8-bit microcontrollers with a few hundred bytes of RAM — and the core trick is pure Duff's Device abuse: a switch statement that jumps back into the middle of a function using saved line numbers. The implementation is around 100 lines of C macros and it runs on hardware where a call stack is a luxury. Reading the Protothreads source even once will make the pattern click in a way no tutorial can. The PT_WAIT_UNTIL and PT_YIELD macros expand to labeled cases inside a switch, and the continuation is just an integer stored in the struct. Same principle, almost identical mechanics to what we're building in C++.

id Software's scripted sequence problem is older than people remember. Before C++20 coroutines and before engines had built-in behavior trees, studios were writing linear "scripts" for NPCs and cutscenes using explicit state machines — a big enum, a switch in the update loop, and a frame counter. Quake-era QuakeC had this baked into the language as a kind of poor man's coroutine: the think function pointer plus nextthink time created a manual continuation. The programmer was storing "resume here next frame" by setting a function pointer, which is stackless coroutine semantics with extra steps. The pattern is genuinely decades old; C++20 didn't invent the idea, it just gave it syntactic sugar.

After shipping a project with the macro-based approach, the thing I changed immediately was ditching YIELD() as a macro and switching to an explicit _line member stored and restored by hand. The macro version looks clean until you're three months in and you need to know which yield point an NPC is stuck at. With the explicit version, you can add a watch on npc.coroutine._line in VS Code's C/C++ extension debug panel and see the exact line number at a glance. With macros, the expanded code doesn't map cleanly to what you see in the editor — breakpoints land in weird places and the call stack is useless. Here's the pattern I settled on:

struct Coroutine {
    int _line = 0;       // which yield point we resume from
    float _timer = 0.f;  // most NPCs need at least one timer
};

// Instead of a YIELD macro:
#define CR_BEGIN(co)    switch((co)._line) { case 0:
#define CR_YIELD(co)    do { (co)._line = __LINE__; return; case __LINE__:; } while(0)
#define CR_END(co)      (co)._line = 0; }

// Usage stays readable, but _line is a real inspectable member:
void patrol_update(PatrolState& s, Coroutine& co, float dt) {
    CR_BEGIN(co);
    move_to(s.point_a);
    CR_YIELD(co);           // breakpoint here works perfectly
    s._timer = 0.f;
    while (s._timer < 2.f) {
        s._timer += dt;
        CR_YIELD(co);       // VS Code shows _line == this exact line number
    }
    move_to(s.point_b);
    CR_YIELD(co);
    CR_END(co);
}

The difference in debuggability is significant. With the macro-based yield storing __LINE__ directly into _line, you can set a conditional breakpoint on co._line == 47 and land exactly where you expect. Before I made this switch, I had a bug where a boss AI was skipping its second phase transition — finding that in a pure macro system took me two hours; with the explicit member it took about ten minutes of watching the value change in the locals pane.

For teams managing the business side of a game studio alongside the technical work, tracking tools, project management, and vendor relationships all compound the cognitive load. Our guide on Essential SaaS Tools for Small Business in 2026 covers what's actually worth paying for versus what you can cut. The same discipline that makes coroutine code auditable — explicit state, no hidden control flow — applies to how you structure the ops side of a small studio.

Quick Integration Checklist
The whole point of keeping this under 200 lines is that integration should take 15 minutes, not a sprint. I've added heavier systems to codebases and watched it consume an entire afternoon of CMakeLists.txt archaeology. This doesn't do that. Here's the exact sequence that works.



    Drop coroutine.h into your project. Seriously, that's it. No build system changes, no new static library to link, no add_subdirectory call. Since the entire implementation lives in a single header, you just #include "coroutine.h" wherever you need it. If you're on MSVC and hitting C++20 warnings, add /std:c++20 to your project flags — same story with -std=c++20 on GCC/Clang. That's the only build requirement.



    Create your scheduler and wire it to your tick. One instance, owned wherever your game loop lives — a GameWorld class, your Engine singleton, doesn't matter. Call scheduler.update() exactly once per frame, after input processing and before rendering. Not twice. Not inside a subsystem update. The scheduler assumes it drives coroutine resumption and double-calling it mid-frame causes resume ordering bugs that are annoying to track down.

// In your game loop — after input, before render
scheduler.update(delta_time);




    Write your first coroutine struct. Inherit from Coroutine, implement resume(), use CO_BEGIN / CO_END to wrap your logic. The __LINE__-based state machine lives entirely inside that block. The thing that trips people up first time: don't declare local variables that need to persist across a yield point as stack locals — they'll be garbage on re-entry. Promote them to struct members.

struct PatrolCoroutine : Coroutine {
    NPC* npc;
    float elapsed = 0.f;  // member, not local — survives yields

    PatrolCoroutine(NPC* n) : npc(n) {}

    void resume(float dt) override {
        CO_BEGIN;
        while (true) {
            npc->walk_to(waypoint_a);
            CO_YIELD_WHILE(npc->is_moving());
            elapsed = 0.f;
            CO_YIELD_FOR(2.0f, elapsed, dt);  // wait 2 seconds
            npc->walk_to(waypoint_b);
            CO_YIELD_WHILE(npc->is_moving());
        }
        CO_END;
    }
};




    Spawn with scheduler.spawn() wherever you'd have initialized a state machine. That means replacing calls like npc.state = NPC_STATE_PATROL; npc.state_timer = 0; with a single spawn call. The coroutine owns its own state now — you don't need to store it on the entity unless you want a handle back for cancellation.

// Old way
enemy->current_state = ENEMY_STATE_SEARCH;
enemy->search_timer = 0.f;
enemy->last_known_pos = player_pos;

// New way
scheduler.spawn(new SearchCoroutine(enemy, player_pos));




    Set MAX_COROUTINES before including the header. The default is intentionally conservative. For a game with a handful of enemies and some cutscene logic, 64 is plenty — I shipped a mobile title under that limit without ever coming close. If you're scripting NPCs in an open-world area where 80+ characters might be active simultaneously, bump it to 256. The memory cost is a fixed-size array of pointers plus the coroutine structs themselves, so this isn't heap allocation per spawn — it's preallocated. Profile your peak active coroutine count in a stress scenario and set it to roughly 1.5x that.

#define MAX_COROUTINES 256   // set BEFORE the include
#include "coroutine.h"


    Hitting the limit doesn't crash silently — spawn() returns nullptr and asserts in debug builds. You'll know immediately if you've undersized it.



One non-obvious thing: coroutines spawned during scheduler.update() — say, an NPC death spawns a loot-drop coroutine — won't execute until the next tick. That's by design and keeps frame ordering deterministic, but it surprises people who expect immediate execution semantics. Just know it going in and you won't spend 30 minutes debugging a one-frame delay.


    Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.



{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "Why not just use C++20 std::coroutine for this?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "std::coroutine is powerful but adds complexity fast: you need a promise type, an awaitable type, and the machinery to wire them together. On top of that, console SDKs and older toolchains don't always support C++20 fully. The ~200-line approach compiles on C++11, gives you full control over allocation, and produces code that any junior dev on your team can read and debug without knowing the coroutine spec. Use std::coroutine when you need to integrate with async I/O or an existing coroutine ecosystem — not for simple game sequencing."
      }
    },
    {
      "@type": "Question",
      "name": "Can I use this for AI behavior, or is it only for scripted sequences?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "It works well for simple AI behaviors: patrol routes, attack patterns, timed ability cooldowns. Where it struggles is complex reactive behavior — if your AI needs to interrupt a sequence based on perception events mid-yield, you end up bolting on flags that defeat the purpose. For that, look at behavior trees (BehaviorTree.CPP is a solid open-source option) or hierarchical state machines. Coroutines and behavior trees aren't mutually exclusive — you can run a coroutine as a leaf node action in a behavior tree."
      }
    },
    {
      "@type": "Question",
      "name": "Is there a heap allocation every time I spawn a coroutine?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "With the naive scheduler using new/delete, yes. If that's a problem (and on consoles it often is), replace it with a fixed pool allocator: pre-allocate an array of MAX_COROUTINES slots of your coroutine struct size, hand them out with placement new, return them on completion. The implementation is about 40 extra lines and eliminates all dynamic allocation from the hot path. The key constraint: all your coroutine structs need to be the same size, or you pool by size class."
      }
    },
    {
      "@type": "Question",
      "name": "How do I handle a coroutine that needs to be cancelled before it finishes?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Add a cancelled flag to the Coroutine base struct. In your COROUTINE_BEGIN, check it first and return false immediately if set. To cancel from outside, call scheduler.cancel(id) which sets the flag — the coroutine will terminate cleanly on its next resume() call rather than being yanked mid-execution. This is safer than removing it from the scheduler list directly, which can leave game state inconsistent if the coroutine owns resources or has registered callbacks."
      }
    }
  ]
}

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

DEV Community