Codebaker

Posted on Mar 30

Inside the Go Scheduler: How GMP Model Powers Millions of Goroutines

#go #beginners #architecture

Introduction

A common question developers ask when learning Go is: "Why goroutines when threads already work?" Take Java, for example—each client request is executed on an OS thread. Simple, straightforward, and battle-tested. So why did Go introduce this additional abstraction?
The answer lies in scalability and efficiency. While OS threads are powerful, they're also heavyweight—creating thousands of them can overwhelm a system. Goroutines, on the other hand, are lightweight and managed by Go's runtime, allowing you to spawn millions without breaking a sweat. But this raises another question: how does Go efficiently map thousands of goroutines onto a limited number of OS threads?
This is where Go's ingenious GMP scheduling model comes into play.

The Challenge: Mapping Goroutines to Threads

OS threads are maintained by the operating system, which means the OS only knows about threads, not goroutines. Therefore, a goroutine must be mapped onto a thread to execute. This implies M:N mapping. At any given time, one thread handles one goroutine.
But how should this mapping occur? Let's explore two approaches and their problems:

Approach 1: A Single Global Queue

Idea: A single global queue where threads push and pull goroutines concurrently.
Problem: This creates lock contention on the global queue. Each thread must acquire a lock, push or pull a goroutine, and then release the lock. Under high goroutine throughput, every thread is constantly fighting over the same queue. Each thread has to:

Acquire a lock on the global queue
Pull/Push a goroutine in the queue
Release the lock

Approach 2: A Local Queue Per Thread

Idea: Give each thread its own local queue, eliminating contention on a shared structure.
Problem: Two issues arise. First, if a goroutine makes a blocking system call, the OS blocks that thread — all goroutines waiting behind it are now stuck, even though the CPU is free. Second, load becomes unbalanced: one thread's queue may hold 100 goroutines while another's is empty, and there is no rebalancing mechanism.

The Solution: The GMP Scheduling Model

The Go developers devised an elegant solution called the GMP scheduling model, which cleverly avoids these bottlenecks. The model consists of three key components:
The Three Components:

G (Goroutine) - The lightweight thread of execution
M (Machine) - An OS thread (the term "Machine" is used in Go's runtime)
P (Processor) - Not a CPU, but a logical processor that acts as a middleman.

Important: There is still a global run queue in the GMP model, but it's not the primary queue. It is used as a secondary queue.

What is a Processor (P)?
Instead of assigning queues directly to threads, Go uses distributed run queues owned by Ps. Each P maintains its own local run queue that holds multiple goroutines. Think of P as a scheduling context that bridges goroutines and threads.
Key relationships:

Each P maintains a local run queue of goroutines
Each P is attached to an M (OS thread)
P controls the parallelism in your program

Understanding GOMAXPROCS: Tuning the Engine's Parallelism

When are Goroutines and Threads Created?

Goroutines (G) are created as per your code instructions (e.g., go functionName())
Threads (M) are created by the scheduler when needed.

Here's the crucial insight: P controls parallelism. The number of Ps determines:

The number of local run queues
The maximum number of goroutines that can run in parallel
The number of threads (Ms) required

The GOMAXPROCS Setting
GOMAXPROCS determines the number of Ps in your program, and it can be manually configured.
Example scenario:

System: 2 CPU cores
Goroutines: 16 created
Setting: GOMAXPROCS = 4

What happens:

4 Ps are created → 4 local run queues
Goroutines are distributed across queues (e.g., 4 goroutines per queue)
4 Goroutines can run in parallel.
The Go runtime requests 4 Ms (threads) from the OS
Each P attaches to an M

The problem: With only 2 CPU cores but 4 threads, the OS must perform context switching between threads at the kernel level, which is relatively expensive.

Best practice: Set GOMAXPROCS = number of CPU cores (this is also the default in modern Go).

Thread Management: Creation, Parking, and Reuse

The scheduler doesn't always create new threads. Here's how Go optimizes thread management:
Scenario 1: Blocking System Call
When a goroutine makes a blocking system call:

The goroutine (G) is blocked
The OS marks the thread (M) executing it as blocked
The local run queue (P) needs to be attached to a thread
The runtime detaches the P from the blocked M — this is called a P Handoff
A new M is created and attached to P to continue running other goroutines.
When the blocking call completes:
- The unblocked G is placed into a randomly chosen local run queue
- The M is parked (not destroyed) to save CPU overhead

Scenario 2: Subsequent Blocking Call
When another goroutine makes a blocking system call:

Again, the local run queue needs attachment to M
This time, no new M is created
Instead, the parked M is reused, saving creation overhead

This parking and reusing strategy significantly reduces the overhead of thread management.

Scheduling Goroutines:

Scheduling loop:

When P needs to assign a G to the attached M, runtime runs this loop:

a. Every 61st goroutine — check global queue first — If a goroutine is found in the global queue, run it. If empty, proceed to b.
b. Check local queue — The P checks its own local queue. If a goroutine is found, run it. If empty, proceed to c.
c. Check global queue — Checked when local queue is empty (skipped if already checked in step a). If a goroutine is found, run it. If empty, proceed to d.
d. Work Stealing — Steals up to half the goroutines from another P. The runtime visits all Ps in a random order and stops when it finds a victim with stealable goroutines.
e. Checks the network poller - Runtime checks if any I/O bound goroutine is ready to resume.

What is network poller - When Goroutine does non-blocking syscall, then the runtime parks the Goroutine, and registers the file descriptor with the netpoller. The M that was executing the Goroutine is not blocked (unlike with blocking syscalls), and it picks up another runnable Goroutine. When the file descriptor is ready, netpoller unparks the Goroutine into a run queue. Examples of non-blocking sys-calls are: net.Dial(), net.Listen() etc.

Cooperative Scheduling:

Go was primarily designed for backend systems that rely heavily on channels, function calls, and I/O. These naturally act as yield points where the scheduler can switch goroutines:

Channel send/receive
System calls
Function calls

This is Cooperative Scheduling, happening entirely within the Go runtime.

The Problem: CPU-Bound Goroutines
A goroutine with no yield points — such as a tight infinite loop — will never cooperate:

func monopoly() {
    for {
        x++ // no function calls, no channel operation — never yields
    }
}

Before Go 1.14, this goroutine would monopolize its P indefinitely, starving every other goroutine in the same local queue.

Preemptive Scheduling (Go 1.14+)

Go 1.14 introduced signal-based preemption as a fallback for CPU-bound goroutines.
The mechanism is driven by sysmon — a background thread that runs without a P, continuously monitoring the scheduler. When sysmon detects a goroutine has been running for approximately 10ms without yielding:

Sysmon sends SIGURG to the M running that goroutine
The Go runtime's signal handler fires and hijacks execution
The goroutine is paused, marked runnable, and placed back in its local queue
The M proceeds to the next goroutine

Timer-based preemption only kicks in for goroutines that never reach a yield point.

Key Takeaways

Distributed Scheduling: Per-P local queues eliminate global lock contention, allowing threads to pick work independently.

Thread Efficiency: Threads (Ms) are parked and reused rather than destroyed, significantly reducing creation overhead.

P Handoff: During blocking syscalls, the P detaches from the blocked M and attaches to a new or parked M to keep other goroutines moving.

Work Stealing: Idle Ps automatically balance the load by stealing half the tasks from a randomly selected P.

Starvation Prevention: The 61-tick rule ensures the global queue is periodically prioritized so no goroutine is left behind.

Hybrid Scheduling: Combines Cooperative yielding at natural code points (I/O, channels) with Signal-based preemption (via sysmon and SIGURG) for long-running CPU tasks.

This design allows Go programs to efficiently manage millions of goroutines with only a handful of OS threads, giving you the simplicity of synchronous code with the performance of asynchronous systems.

DEV Community