Introduction
A common question developers ask when learning Go is: "Why goroutines when threads already work?" Take Java, for example—each client request is executed on an OS thread. Simple, straightforward, and battle-tested. So why did Go introduce this additional abstraction?
The answer lies in scalability and efficiency. While OS threads are powerful, they're also heavyweight—creating thousands of them can overwhelm a system. Goroutines, on the other hand, are lightweight and managed by Go's runtime, allowing you to spawn millions without breaking a sweat. But this raises another question: how does Go efficiently map thousands of goroutines onto a limited number of OS threads?
This is where Go's ingenious GMP scheduling model comes into play.
The Challenge: Mapping Goroutines to Threads
OS threads are maintained by the operating system, which means the OS only knows about threads, not goroutines. Therefore, a goroutine must be mapped onto a thread to execute. This implies M:N mapping. At any given time, one thread handles one goroutine.
But how should this mapping occur? Let's explore two approaches and their problems:
Approach 1: A Single Global Queue
Idea: A single global queue where threads push and pull goroutines concurrently.
Problem: This creates lock contention on the global queue. Each thread must acquire a lock, push or pull a goroutine, and then release the lock. Under high goroutine throughput, every thread is constantly fighting over the same queue. Each thread has to:
- Acquire a lock on the global queue
- Pull/Push a goroutine in the queue
- Release the lock
Approach 2: A Local Queue Per Thread
Idea: Give each thread its own local queue, eliminating contention on a shared structure.
Problem: Two issues arise. First, if a goroutine makes a blocking system call, the OS blocks that thread — all goroutines waiting behind it are now stuck, even though the CPU is free. Second, load becomes unbalanced: one thread's queue may hold 100 goroutines while another's is empty, and there is no rebalancing mechanism.
The Solution: The GMP Scheduling Model
The Go developers devised an elegant solution called the GMP scheduling model, which cleverly avoids these bottlenecks. The model consists of three key components:
The Three Components:
- G (Goroutine) - The lightweight thread of execution
- M (Machine) - An OS thread (the term "Machine" is used in Go's runtime)
- P (Processor) - Not a CPU, but a logical processor that acts as a middleman.
Important: There is still a global run queue in the GMP model, but it's not the primary queue. It is used as a secondary queue.
What is a Processor (P)?
Instead of assigning queues directly to threads, Go uses distributed run queues owned by Ps. Each P maintains its own local run queue that holds multiple goroutines. Think of P as a scheduling context that bridges goroutines and threads.
Key relationships:
- Each P maintains a local run queue of goroutines
- Each P is attached to an M (OS thread)
- P controls the parallelism in your program
Understanding GOMAXPROCS: Tuning the Engine's Parallelism
When are Goroutines and Threads Created?
- Goroutines (G) are created as per your code instructions (e.g., go functionName())
- Threads (M) are created by the scheduler when needed.
Here's the crucial insight: P controls parallelism. The number of Ps determines:
- The number of local run queues
- The maximum number of goroutines that can run in parallel
- The number of threads (Ms) required
The GOMAXPROCS Setting
GOMAXPROCS determines the number of Ps in your program, and it can be manually configured.
Example scenario:
- System: 2 CPU cores
- Goroutines: 16 created
- Setting: GOMAXPROCS = 4
What happens:
- 4 Ps are created → 4 local run queues
- Goroutines are distributed across queues (e.g., 4 goroutines per queue)
- 4 Goroutines can run in parallel.
- The Go runtime requests 4 Ms (threads) from the OS
- Each P attaches to an M
The problem: With only 2 CPU cores but 4 threads, the OS must perform context switching between threads at the kernel level, which is relatively expensive.
Best practice: Set GOMAXPROCS = number of CPU cores (this is also the default in modern Go).
Thread Management: Creation, Parking, and Reuse
The scheduler doesn't always create new threads. Here's how Go optimizes thread management:
Scenario 1: Blocking System Call
When a goroutine makes a blocking system call:
- The goroutine (G) is blocked
- The OS marks the thread (M) executing it as blocked
- The local run queue (P) needs to be attached to a thread
- The runtime detaches the P from the blocked M — this is called a P Handoff
- A new M is created and attached to P to continue running other goroutines.
- When the blocking call completes:
- The unblocked G is placed into a randomly chosen local run queue
- The M is parked (not destroyed) to save CPU overhead
Scenario 2: Subsequent Blocking Call
When another goroutine makes a blocking system call:
- Again, the local run queue needs attachment to M
- This time, no new M is created
- Instead, the parked M is reused, saving creation overhead
This parking and reusing strategy significantly reduces the overhead of thread management.
Scheduling Goroutines:
Scheduling loop:
When P needs to assign a G to the attached M, runtime runs this loop:
a. Every 61st goroutine — check global queue first — If a goroutine is found in the global queue, run it. If empty, proceed to b.
b. Check local queue — The P checks its own local queue. If a goroutine is found, run it. If empty, proceed to c.
c. Check global queue — Checked when local queue is empty (skipped if already checked in step a). If a goroutine is found, run it. If empty, proceed to d.
d. Work Stealing — Steals up to half the goroutines from another P. The runtime visits all Ps in a random order and stops when it finds a victim with stealable goroutines.
e. Checks the network poller - Runtime checks if any I/O bound goroutine is ready to resume.
- What is network poller - When Goroutine does non-blocking syscall, then the runtime parks the Goroutine, and registers the file descriptor with the netpoller. The M that was executing the Goroutine is not blocked (unlike with blocking syscalls), and it picks up another runnable Goroutine. When the file descriptor is ready, netpoller unparks the Goroutine into a run queue. Examples of non-blocking sys-calls are: net.Dial(), net.Listen() etc.
Cooperative Scheduling:
Go was primarily designed for backend systems that rely heavily on channels, function calls, and I/O. These naturally act as yield points where the scheduler can switch goroutines:
- Channel send/receive
- System calls
- Function calls
This is Cooperative Scheduling, happening entirely within the Go runtime.
The Problem: CPU-Bound Goroutines
A goroutine with no yield points — such as a tight infinite loop — will never cooperate:
func monopoly() {
for {
x++ // no function calls, no channel operation — never yields
}
}
Before Go 1.14, this goroutine would monopolize its P indefinitely, starving every other goroutine in the same local queue.
Preemptive Scheduling (Go 1.14+)
Go 1.14 introduced signal-based preemption as a fallback for CPU-bound goroutines.
The mechanism is driven by sysmon — a background thread that runs without a P, continuously monitoring the scheduler. When sysmon detects a goroutine has been running for approximately 10ms without yielding:
- Sysmon sends SIGURG to the M running that goroutine
- The Go runtime's signal handler fires and hijacks execution
- The goroutine is paused, marked runnable, and placed back in its local queue
- The M proceeds to the next goroutine
Timer-based preemption only kicks in for goroutines that never reach a yield point.
Key Takeaways
Distributed Scheduling: Per-P local queues eliminate global lock contention, allowing threads to pick work independently.
Thread Efficiency: Threads (Ms) are parked and reused rather than destroyed, significantly reducing creation overhead.
P Handoff: During blocking syscalls, the P detaches from the blocked M and attaches to a new or parked M to keep other goroutines moving.
Work Stealing: Idle Ps automatically balance the load by stealing half the tasks from a randomly selected P.
Starvation Prevention: The 61-tick rule ensures the global queue is periodically prioritized so no goroutine is left behind.
Hybrid Scheduling: Combines Cooperative yielding at natural code points (I/O, channels) with Signal-based preemption (via sysmon and SIGURG) for long-running CPU tasks.
This design allows Go programs to efficiently manage millions of goroutines with only a handful of OS threads, giving you the simplicity of synchronous code with the performance of asynchronous systems.

Top comments (0)