Silver_dev

Posted on Mar 15

Golang. M:P:G Model

#go #tutorial #performance

The M:P:G model is the core of Go's concurrency model, and it is one of the most efficient among modern programming languages.

It provides:

Efficient CPU core utilization through a fixed number of P's (Processors).
Scalability to hundreds of thousands of goroutines via lightweight G's (Goroutines).
Transparent handling of blocking operations through dynamic management of M's (OS Threads).
Intelligent load balancing using work stealing and spinning threads.

Go abstracts you away from low-level OS thread management, but not from concurrent programming itself. Your task is to design the system by defining the correct concurrency boundaries and synchronization points, while the M:P:G model takes care of efficient execution.

G (Goroutine) — Lightweight Execution Unit

A goroutine is not an operating system thread; it's a runtime-level abstraction representing a lightweight thread of control with minimal overhead.

Initial Stack Size: 2 KB (roughly 1000 times smaller than an OS thread).
Context Switch Overhead: ~200 ns (5-10 times faster than switching OS threads).
Placement: A goroutine is placed into the local run queue of the logical processor (P) designated to execute it.

What is G from the Runtime's Perspective?
It's a data structure that describes the execution state of a single function. It is not rigidly tied to any specific OS thread (M).

Key Fields of the g Struct:

stack: Describes the goroutine's stack memory.
- lo and hi: The lower and upper bounds of the stack in memory.
- guard: Used for stack overflow detection.
- Important: A goroutine's stack starts small (e.g., 2 KB) and grows dynamically (by being copied to a new memory location) as needed.
sched: This field is far more critical than it might initially seem. It's a gobuf structure that saves the execution context (CPU register state) when the goroutine is not running. When a goroutine is paused (parked), it saves its sp (stack pointer), pc (program counter), bp (base pointer), and other registers here. When execution resumes, it restores them from here. This is the essence of "lightweight" switching — switching between G's involves saving and restoring a few dozen bytes in memory, not making an expensive kernel system call.
atomicstatus: The current state of the goroutine.
- _Gidle: Just allocated, but hasn't run yet.
- _Grunnable: Ready to run, currently in a run queue.
- _Grunning: Currently executing on an M and P.
- _Gsyscall: Executing, but inside a system call.
- _Gwaiting: Blocked (waiting on a channel, mutex, or network I/O).
- _Gdead: Not in use (held in a pool for reuse to avoid allocations).
- _Gcopystack: Temporary state used while resizing (copying) the stack.
m: A pointer to the M struct (OS thread) if the goroutine is running (_Grunning) or was interrupted during a system call (_Gsyscall). Otherwise, it's nil.
waitreason: If the status is _Gwaiting, this field holds a string explaining the reason for blocking (e.g., "chan receive").

A goroutine knows nothing about CPU cores or OS threads. Its world consists of its stack, its saved registers, and the queue it resides in. It essentially says: "I have code that needs to be executed, and a context (registers/stack) from which to start or continue."

P (Processor) — The Conductor

P is an abstraction of a logical processor, the connecting link between M (OS thread) and G (goroutine). This is the most important and uniquely Go concept. P is not a physical CPU core, but a logical resource required to execute Go code.

What is P from the Runtime's Perspective?
It is a scheduling and resource allocation context. Without a P, an M has an OS thread but no "permission" to execute goroutine code (except in special cases, like the g0 system stack).

Key Fields of the p Struct:

runq: The Local Run Queue. This is a fixed-size circular buffer (typically 256 elements). Goroutines with the status _Grunnable are placed here.
runnext: A pointer to the next goroutine to be executed. This is an optimization, particularly for channel operations. When a goroutine becomes unblocked (e.g., after a channel send), it is often placed here to be executed immediately as the next step on the same P. This gives priority to recently unblocked goroutines.
mcache: The local memory allocation cache. Memory allocation in Go is very fast because each P has its own cache (mcache) for small objects. This allows goroutines on different Ps to allocate memory without global locks.
status: The state of the P.
- _Pidle: Not currently in use.
- _Prunning: Bound to an M and executing code.
- _Psyscall: Bound to an M, but that M is currently inside a system call.
- _Pgcstop: Stopped for garbage collection.
- _Pdead: No longer in use.
schedtick, syscalltick: Counters for statistics and for detecting "stuck" goroutines.

P is the main arbiter of resources. The number of P's determines the degree of parallelism.

If you have 4 P's, then a maximum of 4 goroutines can be executing simultaneously (in parallel) at any given moment.
P decouples M and G. Thanks to P, an M doesn't need to know which goroutines are running on other Ms. Each P only concerns itself with its own queue.
When garbage collection (GC) starts, all P's are stopped (_Pgcstop), guaranteeing that no goroutine is mutating memory during the "stop the world" (STW) phase.

Key Characteristics

Number of P's:
- Determined by GOMAXPROCS (defaults to the number of CPU cores).
- Fixed for the lifetime of the program (can be changed via runtime.GOMAXPROCS).
- Important: This is NOT the maximum number of concurrent operations, but the number of execution contexts.
Local Run Queue:
- Size: 256 elements (circular buffer).
- Lock-free access (no mutexes) for the currently bound M.
- Uses the "work stealing" algorithm for load balancing.
At Program Startup:
- GOMAXPROCS instances of P are created.
- Each P receives its own local run queue for goroutines.
- When a goroutine is created, it is placed into the queue of the current P.

// Distribution example
P0: [G1, G2, G3, G4]
P1: [G5, G6]
P2: []
P3: [G7, G8, G9]

Work Stealing in Action

When P0 finishes executing goroutines and its local queue is empty:
It first checks the global run queue.
If the global queue is empty, it "steals" approximately half of the goroutines from another P's queue (e.g., P3).
Algorithm: The stealing starts from the middle of the victim's queue, not the beginning, to ensure better load balancing.

Data Locality: The local queue reduces contention (lock contention) and improves cache efficiency.
Load Balancing: Work stealing ensures an even distribution of workload across available P's.
Potential Pitfall: Setting GOMAXPROCS too high (significantly exceeding the number of physical CPU cores) can increase overhead due to more frequent stealing attempts and context switching.

M (Machine) — The Executor

If G represents the "work," then M represents the "worker." It is a direct representation of an operating system thread. M is the OS thread that actually executes machine code. It serves as the crucial link between the OS and the Go runtime.

What is M from the Runtime's Perspective?
It's a structure that manages a single OS thread. M is the only entity that actually executes anything.

Key Fields of the m Struct:

g0: A pointer to a special goroutine called g0. This is the most important concept for understanding M's operation.
- Every M has its own dedicated g0 goroutine.
- Why is it needed? The code running in a regular goroutine (G) is "user" code. However, the code for the scheduler, the garbage collector, and signal handling is "system" (runtime) code. When M needs to execute scheduler code (e.g., to choose the next G to run), it switches to its g0 stack. This is a separate stack allocated specifically for runtime tasks.
- This separation is critical for safety: user code cannot accidentally or maliciously corrupt the scheduler's stack, and vice versa. Switching between a user G and g0 is done via the runtime·mcall or runtime·systemstack mechanisms.
curg: A pointer to the current user goroutine being executed. If M is executing on its g0 stack, curg might be nil or point to the goroutine that was preempted.
p: A pointer to the Processor (P) to which this M is currently attached. If M is idle or inside a system call without a P, this field can be nil.
nextp, oldp: Used for temporarily holding a P during transitions (e.g., while entering or exiting a system call).
spinning: A flag indicating that this M is actively seeking work even though its associated P's queue is empty. This is an important optimization for load balancing.
park: A structure used to park (put to sleep) the M itself when there is absolutely no work to do.

M lives in two worlds simultaneously. It executes both user code (on the curg stack) and runtime code (on the g0 stack). The switching between these modes happens very frequently and must be extremely fast. This is where the magic of "lightweight" goroutines occurs: M switches between different G's, saving the context of one and loading the context of another, all while remaining the same underlying OS thread.

Key Characteristics

M-to-P Relationship:
- At any given moment, an M can be attached to at most one P.
- The number of M's can exceed the number of P's.
- Maximum number of M's: ~10,000 (limited by maxmcount).
M States:
- Spinning: Actively searching for work (high priority).
- Idle: Idle, but can be activated if work appears.
- Blocked: Blocked in the OS (e.g., during a system call).
Special M's:
- M0: The first OS thread created and run when the program starts.
- G0: A special goroutine for running system code on each M (one per M).
- Sysmon: The system monitor (a dedicated M for background tasks like network polling, preemption, and GC triggering).

How It Looks in Action

Scenario 1: Normal Execution

M1 is attached to and executing on P1.
P1 processes goroutines from its local run queue.
M1 executes the goroutine's code on its curg stack.

Scenario 2: Blocking System Call

A running goroutine G2 makes a blocking system call and transitions to the _Gsyscall state.
M1 (which was running G2) detaches itself from its P (P1). P1 is now free (_Pidle).
Another M (e.g., M2) can now acquire the now-idle P1 and continue executing other goroutines from P1's local queue, preventing the CPU core from sitting idle.

Scenario 3: Recovery After a System Call

When the blocking system call (e.g., a network request) completes for G2:
M1 (which was blocked with G2) attempts to re-acquire its original P, P1.
If P1 is now busy (being used by M2), M1 cannot get it back immediately. It places G2 into the global run queue.
The sysmon thread or another spinning M will eventually notice G2 in the global queue and wake up or direct an idle M to handle G2.

The "Spinning Threads" Mechanism

The Go runtime uses heuristics to maintain "hot" (spinning) threads to minimize latency.

Spinning M: OS threads that are actively searching for work (checking local/global queues, attempting work stealing) even though they aren't currently executing a goroutine.
Goal: To minimize the delay (latency) when new tasks become available. A spinning M can pick up a new task immediately without waiting for an idle thread to wake up.
Algorithm:
- If there are idle P's (available CPU capacity) and no spinning M's currently looking for work, the runtime may create a new spinning M.
- A spinning M searches for work for a short period (approximately 10 microseconds).
- If it doesn't find any work within that time, it eventually transitions to an idle state and is parked to save power/CPU.
Efficiency: Spinning M's significantly reduce latency when new tasks appear.
Overhead: Too many spinning M's would waste CPU cycles.
Limitation: The number of spinning M's is automatically and dynamically regulated by the runtime based on the current load and available work.

Key Interactions

Goroutine Creation (go func()):
- The new goroutine (G) is placed into the local run queue of the current P.
- If the local queue is full, part of its contents (along with the new G) is moved to the global run queue to make space.
Goroutine Execution (Scheduling Cycle):
- An M must acquire (or already be attached to) a P.
- It retrieves a runnable goroutine (G) from the P's local run queue (checking runnext first, then the circular buffer runq).
- If the local queue is empty, it attempts work stealing from other P's or checks the global queue.
- The M executes the goroutine's code on its curg stack. Execution continues until the goroutine blocks, makes a system call, is preempted (due to running too long), or voluntarily yields.

Overhead Metrics

Operation	Time (Go 1.20)	Notes
Goroutine Creation	~200 ns	Including stack allocation
Goroutine Context Switch	~200 ns	Without a system call
Blocking System Call	~500 ns	Including parking the goroutine
Network Operation (Non-blocking)	~100 ns	Handled via the netpoller

M:P:G Interaction — A Simplified Analogy

When you write go func() { ... }(), you are creating a G.

The runtime places this G into the queue of some P (or the global queue, if necessary).
The M (OS thread) attached to that P is awakened (if idle) and executes the code of G.
If G makes a blocking system call, the M detaches from its P. The P, now free, can find (or be picked up by) another M to continue executing other G's from its queue.
If G simply runs a CPU-bound loop for ~10ms, the scheduler preempts it and hands control over to another G waiting in the P's local queue.

To solidify this concept, imagine a factory representing the Go process:

G (Goroutine) — is a task or job (like a blueprint and materials on a small, dedicated work table). There can be thousands of these tasks.

M (Machine) — is a worker. Each worker has their own personal toolbox (the g0 stack) for performing internal, system-level tasks needed to manage the jobs.

P (Processor) — is a workbench. On the workbench, there is a list (local queue) of jobs waiting to be done, and a special spot (runnext) for the most urgent next job. The worker (M) comes to the workbench (P) to do the actual work (G).

The Workflow Process:

For a worker (M) to start working on a task (G), they must approach a workbench (P) and take a task from it.

The worker places the task's blueprint onto the workbench (curg) and begins working.
If the task needs to wait (for example, for materials to arrive — analogous to blocking on a channel), the worker puts the blueprint away in the desk drawer (g0 stack) and takes the next task from the workbench.
If the worker needs to step away for a very important reason (a system call), they leave the workbench (detach from the P) so that another worker can use it.
If the workbench runs out of tasks, the worker can approach a neighboring workbench and steal half of its tasks (work stealing).

Concurrency ≠ Parallelism

This is perhaps the most important statement by Rob Pike to understand.

Concurrency: This is about the composition of independently executing tasks. It's about structuring a program. Imagine you are cooking dinner: you put water on for pasta, you chop tomatoes for a salad, and you keep an eye on the patties in the pan. You are one person, but you are managing multiple tasks by switching between them. You are concurrent.
Parallelism: This is about the simultaneous execution of tasks. For this, you need multiple cooks (CPU cores). One cook boils the pasta, a second chops the salad, and a third fries the patties. They are working in parallel.

Go is built for concurrency first and foremost. It makes it easy to write programs structured as a set of interacting tasks. Whether this code actually executes in parallel depends on the hardware (the number of CPU cores) and the runtime settings (like GOMAXPROCS).

Characteristic	OS Threads	Go Goroutines
Stack Size	1-8 MB	2 KB (initial)
Creation	1000+ ns	~200 ns
Context Switch	1000+ ns	~200 ns
Management	OS Kernel	Go Runtime
Scalability	100-1000 threads	100,000+ goroutines

Scheduler Overhead:

Operation	Time (Go 1.20)
Goroutine Creation	~200 ns
Goroutine Context Switch	~200 ns
Blocking System Call	~500 ns
Network Operation (Non-blocking)	~100 ns

Goroutine Parking

The term "parking" in the context of the scheduler means that a running goroutine (G) is transitioned into a waiting state, and the OS thread (M) on which it was running is freed up to execute other goroutines. This is a key mechanism that prevents OS threads from idling when a goroutine blocks (e.g., waiting for data from a channel or for a system call to complete).

The process of "parking" and "unparking" is inextricably linked to the concepts of handoff and work stealing.

Scenarios for Goroutine Parking and M Behavior

Let's examine the main scenarios that clearly illustrate how M and G interact.

Blocking System Call (e.g., reading from a file) This is the classic case where a goroutine enters a long blocking call to the OS kernel.

Situation: Goroutine G1 is running on thread M1, which is attached to processor P1. G1 initiates a blocking system call (e.g., read(fileFd)).
Scheduler Action: The Go scheduler understands that M1 will be blocked by the kernel for an indeterminate amount of time. It detaches (handoff) P1 from M1. M1 goes into waiting for the system call to complete, along with G1.
Result: The now-freed P1 no longer has a thread. The scheduler takes another thread, for example, M2 (from a sleeping pool or creates a new one), and attaches it to P1. M2 begins executing the next goroutine G2 from P1's local run queue.
Return: When the system call in G1 completes, M1 wakes up. What should G1 do? It needs to run again. The scheduler looks for a free P. If there is a free P (e.g., P2), M1 with G1 attaches to it and continues working. If not, G1 is placed into the global run queue, and M1 becomes a spare (sleeping) thread.

Blocking on a Channel or Mutex Here, the blocking occurs not at the kernel level, but at the runtime level.

Situation: Goroutine G1 on M1 (with P1) attempts to read from a channel that has no data.
Action: The scheduler transitions G1 into a waiting state and places it into the wait queue of that channel. Important: M1 is no longer needed for this goroutine.
Result: M1 remains attached to P1 and immediately (in the same scheduler cycle) takes the next runnable goroutine G2 from P1's local queue and starts executing it. The OS thread does not idle for a nanosecond.
Return: When another goroutine sends data into the channel, G1 becomes "runnable". It will be added to the local queue of the P on which the sender goroutine is running (to improve data locality) or to the global run queue.

How the Scheduler Balances Load: Work Stealing

What happens when a thread has nothing to do? Imagine: on M1, attached to P1, all goroutines in the local queue are finished, and the global queue is empty. The scheduler will not let M1 simply go to sleep while others have work.

The work stealing mechanism is activated:

M1, in a "spinning" state (actively seeking work), selects another processor, for example, P2, and attempts to steal approximately half of the goroutines from its local run queue.
If successful, M1 begins executing the stolen goroutines. This way, the load is evenly distributed across all available OS threads.

The "Spinning Threads" Mechanism

When a thread (M) looks for work, as in the example above, it enters a spinning state. Spinning threads are not parked immediately; instead, they actively search for work for a short time (stealing, checking the global queue, the netpoller), consuming CPU resources in the process.

Why is this necessary? To avoid the frequent and expensive creation/destruction or parking/unparking of OS threads. The algorithm works like this:

When new work appears (a runnable goroutine), the scheduler checks if there are any idle P's and if there are already any spinning threads.
If there is an idle P and no spinning threads exist, the scheduler wakes up (unparks) a new OS thread (M) or creates one. This new thread enters spinning mode.
If spinning threads already exist, a new thread is not woken up — the existing ones will handle the work themselves.

This mechanism keeps exactly as many active threads as needed for maximum core utilization, without unnecessary context-switching overhead.

Locality and Affinity

The Go runtime tries, as much as possible, to keep a goroutine attached to the same thread. Why? This improves CPU cache locality. If a goroutine constantly jumps from thread to thread, its data must be constantly invalidated and reloaded into the cache of the new core, which is slow.

Thanks to the local queues on P and the work stealing mechanism, a goroutine that was blocked on a channel is highly likely to be resumed on the same P (and consequently, the same M) where it was previously running. The scheduler even gives priority to goroutines just unparked from channels, placing them at the front of the queue (runnext) so they execute as soon as possible and resume work with a "warm" cache.

Schematic Lifecycle

G is Running: G executes on M (which is attached to P).
G Blocks: G blocks (syscall, channel, mutex). The scheduler transitions G to Waiting.
- If syscall: M detaches from P, P looks for a new M. G remains on M.
- If channel/mutex: G enters the resource's wait queue. M stays with P and takes a new G from the local queue.
G becomes Runnable: The resource G was waiting for becomes available. The scheduler adds G to a queue (most often the local queue of the P that unblocked it).
G is Selected for Execution: When its turn comes, some M attached to this P will remove G from the queue and begin executing it.

DEV Community

Golang. M:P:G Model

G (Goroutine) — Lightweight Execution Unit

P (Processor) — The Conductor

M (Machine) — The Executor

Top comments (0)