DEV Community

Codebaker
Codebaker

Posted on

Inside the Go Scheduler: How GMP Model Powers Millions of Goroutines

Introduction

A common question developers ask when learning Go is: "Why goroutines when threads already work?" Take Java, for example—each client request is executed on an OS thread. Simple, straightforward, and battle-tested. So why did Go introduce this additional abstraction?
The answer lies in scalability and efficiency. While OS threads are powerful, they're also heavyweight—creating thousands of them can overwhelm a system. Goroutines, on the other hand, are lightweight and managed by Go's runtime, allowing you to spawn millions without breaking a sweat. But this raises another question: how does Go efficiently map thousands of goroutines onto a limited number of OS threads?
This is where Go's ingenious GMP scheduling model comes into play.

The Challenge: Mapping Goroutines to Threads

OS threads are maintained by the operating system, which means the OS only knows about threads, not goroutines. Therefore, a goroutine must be mapped onto a thread to execute. This implies M:N mapping. At any given time, one thread handles one goroutine.
But how should this mapping occur? Let's explore two approaches and their problems:

Approach 1: A Single Global Queue

Idea: A single global queue where threads push and pull goroutines concurrently.
Problem: This creates lock contention on the global queue. Each thread must acquire a lock, push or pull a goroutine, and then release the lock. Under high goroutine throughput, every thread is constantly fighting over the same queue. Each thread has to:

  1. Acquire a lock on the global queue
  2. Pull/Push a goroutine in the queue
  3. Release the lock

Approach 2: A Local Queue Per Thread

Idea: Give each thread its own local queue, eliminating contention on a shared structure.
Problem: Two issues arise. First, if a goroutine makes a blocking system call, the OS blocks that thread — all goroutines waiting behind it are now stuck, even though the CPU is free. Second, load becomes unbalanced: one thread's queue may hold 100 goroutines while another's is empty, and there is no rebalancing mechanism.

The Solution: The GMP Scheduling Model

The Go developers devised an elegant solution called the GMP scheduling model, which cleverly avoids these bottlenecks. The model consists of three key components:
The Three Components:

  1. G (Goroutine) - The lightweight thread of execution
  2. M (Machine) - An OS thread (the term "Machine" is used in Go's runtime)
  3. P (Processor) - Not a CPU, but a logical processor that acts as a middleman.

Important: There is still a global run queue in the GMP model, but it's not the primary queue. It is used as a secondary queue.

What is a Processor (P)?
Instead of assigning queues directly to threads, Go uses distributed run queues owned by Ps. Each P maintains its own local run queue that holds multiple goroutines. Think of P as a scheduling context that bridges goroutines and threads.
Key relationships:

  • Each P maintains a local run queue of goroutines
  • Each P is attached to an M (OS thread)
  • P controls the parallelism in your program

Understanding GOMAXPROCS: Tuning the Engine's Parallelism

When are Goroutines and Threads Created?

  • Goroutines (G) are created as per your code instructions (e.g., go functionName())
  • Threads (M) are created by the scheduler when needed.

Here's the crucial insight: P controls parallelism. The number of Ps determines:

  • The number of local run queues
  • The maximum number of goroutines that can run in parallel
  • The number of threads (Ms) required

The GOMAXPROCS Setting
GOMAXPROCS determines the number of Ps in your program, and it can be manually configured.
Example scenario:

  • System: 2 CPU cores
  • Goroutines: 16 created
  • Setting: GOMAXPROCS = 4

What happens:

  • 4 Ps are created → 4 local run queues
  • Goroutines are distributed across queues (e.g., 4 goroutines per queue)
  • 4 Goroutines can run in parallel.
  • The Go runtime requests 4 Ms (threads) from the OS
  • Each P attaches to an M

The problem: With only 2 CPU cores but 4 threads, the OS must perform context switching between threads at the kernel level, which is relatively expensive.

Best practice: Set GOMAXPROCS = number of CPU cores (this is also the default in modern Go).

Thread Management: Creation, Parking, and Reuse

The scheduler doesn't always create new threads. Here's how Go optimizes thread management:
Scenario 1: Blocking System Call
When a goroutine makes a blocking system call:

  • The goroutine (G) is blocked
  • The OS marks the thread (M) executing it as blocked
  • The local run queue (P) needs to be attached to a thread
  • The runtime detaches the P from the blocked M — this is called a P Handoff
  • A new M is created and attached to P to continue running other goroutines.
  • When the blocking call completes:
    • The unblocked G is placed into a randomly chosen local run queue
    • The M is parked (not destroyed) to save CPU overhead

Scenario 2: Subsequent Blocking Call
When another goroutine makes a blocking system call:

  • Again, the local run queue needs attachment to M
  • This time, no new M is created
  • Instead, the parked M is reused, saving creation overhead

This parking and reusing strategy significantly reduces the overhead of thread management.

Scheduling Goroutines:

Scheduling loop:

When P needs to assign a G to the attached M, runtime runs this loop:

Goroutine scheduling

a. Every 61st goroutine — check global queue first — If a goroutine is found in the global queue, run it. If empty, proceed to b.
b. Check local queue — The P checks its own local queue. If a goroutine is found, run it. If empty, proceed to c.
c. Check global queue — Checked when local queue is empty (skipped if already checked in step a). If a goroutine is found, run it. If empty, proceed to d.
d. Work Stealing — Steals up to half the goroutines from another P. The runtime visits all Ps in a random order and stops when it finds a victim with stealable goroutines.
e. Checks the network poller - Runtime checks if any I/O bound goroutine is ready to resume.

  • What is network poller - When Goroutine does non-blocking syscall, then the runtime parks the Goroutine, and registers the file descriptor with the netpoller. The M that was executing the Goroutine is not blocked (unlike with blocking syscalls), and it picks up another runnable Goroutine. When the file descriptor is ready, netpoller unparks the Goroutine into a run queue. Examples of non-blocking sys-calls are: net.Dial(), net.Listen() etc.

Cooperative Scheduling:

Go was primarily designed for backend systems that rely heavily on channels, function calls, and I/O. These naturally act as yield points where the scheduler can switch goroutines:

  • Channel send/receive
  • System calls
  • Function calls

This is Cooperative Scheduling, happening entirely within the Go runtime.

The Problem: CPU-Bound Goroutines
A goroutine with no yield points — such as a tight infinite loop — will never cooperate:

func monopoly() {
    for {
        x++ // no function calls, no channel operation — never yields
    }
}
Enter fullscreen mode Exit fullscreen mode

Before Go 1.14, this goroutine would monopolize its P indefinitely, starving every other goroutine in the same local queue.

Preemptive Scheduling (Go 1.14+)

Go 1.14 introduced signal-based preemption as a fallback for CPU-bound goroutines.
The mechanism is driven by sysmon — a background thread that runs without a P, continuously monitoring the scheduler. When sysmon detects a goroutine has been running for approximately 10ms without yielding:

  1. Sysmon sends SIGURG to the M running that goroutine
  2. The Go runtime's signal handler fires and hijacks execution
  3. The goroutine is paused, marked runnable, and placed back in its local queue
  4. The M proceeds to the next goroutine

Timer-based preemption only kicks in for goroutines that never reach a yield point.

Key Takeaways

Distributed Scheduling: Per-P local queues eliminate global lock contention, allowing threads to pick work independently.

Thread Efficiency: Threads (Ms) are parked and reused rather than destroyed, significantly reducing creation overhead.

P Handoff: During blocking syscalls, the P detaches from the blocked M and attaches to a new or parked M to keep other goroutines moving.

Work Stealing: Idle Ps automatically balance the load by stealing half the tasks from a randomly selected P.

Starvation Prevention: The 61-tick rule ensures the global queue is periodically prioritized so no goroutine is left behind.

Hybrid Scheduling: Combines Cooperative yielding at natural code points (I/O, channels) with Signal-based preemption (via sysmon and SIGURG) for long-running CPU tasks.

This design allows Go programs to efficiently manage millions of goroutines with only a handful of OS threads, giving you the simplicity of synchronous code with the performance of asynchronous systems.

Top comments (0)