Aceld

Posted on Nov 13, 2023 • Edited on Aug 6

(Golang Triad)-I-Understanding the Golang Goroutine Scheduler GPM Model

#go

Author:
discord: https://discord.gg/xQ8Xxfyfcz
zinx: https://github.com/aceld/zinx
github: https://github.com/aceld
aceld's home: https://yuque.com/aceld
Twitter: https://twitter.com/Aceld_

(Golang Triad)-I-Understanding the Golang Goroutine Scheduler GPM Model
(Golang Triad)-II-Comprehensive Analysis of Go's Hybrid Write Barrier Garbage Collection

Understanding the principles of the Golang Goroutine scheduler is a fundamental knowledge area that Golang developers strive to grasp. Golang's language features are accompanied by a well-designed scheduling mechanism and a high-performance Goroutine scheduling model. For Golang developers, mastering and deeply comprehending this knowledge is essential. This chapter introduces the origins of the scheduler in Golang, as well as how it evolved into the GPM model. It includes detailed insights into the actions taken when a Go Goroutine is launched and how the GPM model is loaded. Additionally, it covers visual programming and debugging analysis of the GPM model and concludes by vividly describing various trigger conditions and operational scenarios of the GPM model.

1.1 The Origin of the Golang "Scheduler"

As widely recognized, all software operates on top of an operating system, and the true computational engine behind these software applications is the CPU. In the early days of operating systems, each program was treated as a separate process. Only when one program completed its execution could the next process begin, marking the era of single processing. During this time, all programs could only execute sequentially, as illustrated in Figure 1.1.

Figure 1.1

1.1.1 The Era of Single Processing Did Not Require a Scheduler

In the early days of single-process operating systems, two major issues were encountered:

(1) Singular execution flow: Computers could only handle one task at a time, and virtually all programs operated in a blocking manner. Asynchronous interactions, such as graphical interfaces or mouse input, were practically non-existent.

(2) Wasted CPU time due to process blocking: During the complete lifecycle of a process, various physical components needed to be accessed, including the CPU, cache, main memory, disk, and network, each with significantly different processing speeds. When these diverse processing media were sequentially linked within a single process, it resulted in instances of high-speed media waiting and wastage. For instance, when a program loaded data from a disk, the CPU would remain in a waiting state during the read/write process. In the context of a single-process operating system, this clearly led to a waste of CPU processing capacity. Ideally, the CPU should have been allocated to other processes for higher-level computations at such moments.

So, could there be multiple processes running concurrently to collectively execute multiple tasks at a macroscopic level? Subsequently, operating systems gained their initial concurrency capability: multi-process concurrency. When one process became blocked, the system switched to another waiting process for execution, thereby optimizing CPU utilization and preventing CPU wastage.

1.1.2 Scheduler Requirements in the Era of Multi-Process/Multi-Thread

The advent of multi-process/multi-threaded operating systems addressed the blocking issues. When one process blocked the CPU, it could immediately switch to executing other processes. Moreover, the scheduling algorithms for CPU allocation ensured that all running processes received their fair share of CPU time slices. From a macroscopic perspective, it appeared as though multiple processes were running concurrently, as illustrated in Figure 1.2.

Figure 1.2

As depicted in Figure 1.2, it illustrates a scenario where the CPU switches its timeline through a scheduler. To achieve the macroscopic perception that every process/thread is executing concurrently, the CPU must perform switching, allocating each process to a time slice.

A time slice, also referred to as a "quantum" or "processor slice," is a microscopically allocated segment of CPU time that a time-sharing operating system assigns to each running process (or, in preemptive kernels, the time from when a process begins running until it gets preempted). Modern operating systems like Windows, Linux, and macOS allow multiple processes to run simultaneously. For example, you can listen to music using a media player while browsing the internet and downloading files. In reality, even though a computer may have multiple CPUs, a single CPU cannot truly run multiple tasks simultaneously. In the context of a single CPU, these processes "appear to" run concurrently, but they actually execute in a interleaved fashion. This interleaving is made possible by the allocation of short time slices (typically ranging from 5ms to 800ms on Linux), which are too short for users to perceive.

The operating system's scheduler allocates time slices to each process. Initially, the kernel assigns an equal time slice to each process. Subsequently, processes take turns executing during their allotted time slices. When all processes have exhausted their time slices, the kernel recalculates and assigns time slices to each process, and this process repeats.

However, a new problem emerged: processes consumed significant resources. The creation, switching, and termination of processes took a considerable amount of time. While CPU utilization improved, if there were too many processes, a substantial portion of the CPU's capacity was devoted to process scheduling and context switching, as illustrated in Figure 1.3.

Figure 1.3

For the Linux operating system, the CPU treats processes and threads in the same manner. As depicted in Figure 1.3, when the system has a limited number of CPUs and a relatively large number of processes/threads, the frequency of context switching increases, and the associated switching costs become more significant. This portion of performance consumption essentially doesn't contribute to useful computational power within the program. Consequently, even though multithreading may seem appealing, in reality, multithreaded development and design can become more complex. Developers need to address numerous issues related to synchronization and competition, such as locks, resource contention, and synchronization conflicts.

1.1.3 Coroutines Improve CPU Utilization

So, how can CPU utilization be improved? Multi-processes and multi-threads have indeed enhanced system concurrency. However, in today's high-concurrency internet scenarios, creating a separate thread for each task is not practical. Doing so would result in an overwhelming number of threads running simultaneously, leading to not only high context switching overhead but also substantial memory consumption (process virtual memory can consume up to 4GB on 32-bit operating systems, and threads typically require around 4MB each). This proliferation of processes or threads introduced new problems:

(1) High memory consumption.
(2) Significant CPU overhead in scheduling.

Engineers discovered that a thread could be divided into two states: "kernel mode" and "user mode." User-mode threads essentially mimic kernel-mode threads in user space, with the aim of achieving greater lightweightness (lower memory consumption, reduced isolation, faster scheduling) and higher controllability (ability to manage the scheduler independently). In user mode, everything is visible to kernel mode, but from the kernel's perspective, user-mode threads are merely a set of memory data.

A user-mode thread must be bound to a kernel-mode thread, but the CPU is unaware of the existence of user-mode threads; it only knows that it is running a kernel-mode thread (Linux's PCB - Process Control Block), as illustrated in Figure 1.4.

Figure 1.4

If we further classify threads, kernel threads remain as "threads," while user threads are referred to as "coroutines" (Co-routine). Threads at the operating system level are what we commonly know as kernel-mode threads, while user-mode threads come in various forms. As long as they can execute multiple tasks on the same kernel-mode thread, they are considered user-mode threads. Examples include coroutines, Golang's Goroutines, C#'s Tasks, and more.

Since one coroutine can be bound to one thread, is it possible for multiple coroutines to be bound to one or more threads? There are three mapping relationships between coroutines and threads: N:1, 1:1, and M:N.

1、"N:1" Relationship

N coroutines are bound to 1 thread. The advantage of this approach is that coroutines perform context switches within user-mode threads without transitioning to kernel mode, making these switches extremely lightweight and fast. However, the drawback is evident: all coroutines of one process are bound to a single thread, as illustrated in Figure 1.5.

Figure 1.5

Challenges Faced in the N:1 Relationship:
(1) Inability to harness the hardware's multi-core acceleration capabilities for a particular program.
(2) When one coroutine is blocked, it leads to thread blocking, causing all other coroutines within the same process to become non-executable, thus resulting in a lack of concurrency.

** 2、"1:1" Relationship**

In a 1 coroutine to 1 thread relationship, it's the simplest to implement. Coroutine scheduling is entirely handled by the CPU. While it doesn't suffer from the drawbacks of N:1 relationships, the cost and overhead of creating, deleting, and switching coroutines are all borne by the CPU, making it slightly more expensive. The 1:1 relationship between coroutines and threads is depicted in Figure 1.6.

Figure 1.6

3、"M:N" Relationship

In an M coroutines to 1 thread relationship, it combines aspects of both the N:1 and 1:1 models, overcoming the drawbacks of the two aforementioned approaches. However, it is the most complex to implement, as illustrated in Figure 1.7. Multiple coroutines are hosted on the same scheduler, while the scheduler's downstream resources comprise multiple CPU cores. It's important to note that coroutines and threads differ in their nature. Thread scheduling by the CPU is preemptive, while coroutine scheduling in user mode is cooperative; one coroutine yields the CPU before the next one is executed. Consequently, the design of an intermediate-level scheduler in the M:N model becomes crucial, and improving the binding relationship and execution efficiency between threads and coroutines becomes a diverse set of languages' primary goals when designing schedulers.

Figure 1.7

1.1.4 Go Language's Goroutines

In order to provide a more user-friendly approach to concurrency, Go utilizes Goroutines and Channels. Goroutines are derived from the concept of coroutines, allowing a set of reusable functions to run on a set of threads. Even if one Goroutine is blocked, the runtime scheduler can still move other Goroutines of the same thread to other runnable threads. Most importantly, programmers are shielded from these low-level details, which simplifies programming and offers easier concurrency.

In Go, these coroutines are called Goroutines, and they are incredibly lightweight, occupying only a few kilobytes each. This small memory footprint allows Go to support a large number of Goroutines within limited memory space, facilitating greater concurrency. While a Goroutine's stack may only consume a few kilobytes, it is dynamically scalable, with the runtime automatically allocating more memory when needed.

Key characteristics of Goroutines include minimal memory usage (just a few KB) and flexible scheduling (handled by the runtime).

1.1.5 Deprecated Goroutine Scheduler

Now that we understand the relationship between coroutines and threads, the most critical aspect is the implementation of the scheduler responsible for scheduling coroutines. The scheduler currently used in Go was redesigned in 2012 because the previous scheduler had performance issues, leading to its abandonment after just four years of use. So, let's first analyze how the deprecated scheduler worked. Typically, Goroutines are denoted by the symbol G, and threads are denoted by M, as illustrated in Figure 1.8. The following content regarding the scheduler will use the symbols shown in Figure 1.8 for consistency.

Figure 1.8

Let's take a closer look at how the deprecated Golang scheduler was implemented. As shown in Figure 1.9, the early scheduler was built on top of the M:N model. Figure 1.9 provides a high-level overview. All coroutines, denoted as G, were placed in a global Goroutine queue. Outside the global queue, where multiple M's shared resources, a lock was added for synchronization and mutual exclusion purposes.

In order for an M to execute or return a G, it must access the global G queue. Since there are multiple M's, which implies multiple threads accessing the same resource, locking is required to ensure mutual exclusion and synchronization. Therefore, the global G queue is protected by a mutex lock.

Several shortcomings of the old scheduler can be identified:
(1) Creating, destroying, and scheduling G all require each M to acquire the lock, leading to intense lock contention.
(2) Transferring G between M's results in delays and additional system overhead. For example, when G contains the creation of a new coroutine, and M creates G', in order to continue executing G, G' must be handed over to M2 (if allocated). This also leads to poor locality because G' and G are related, and it is preferable to execute them on the same M rather than a different M2, as depicted in Figure 1.10.

Figure 1.10

(3) System calls (CPU switches between M's) resulted in frequent thread blocking and unblocking operations, leading to increased system overhead.

1.2 Design Principles of the Goroutine Scheduler GPM Model

In response to the issues with the previous scheduler, Go introduced a new scheduler design. In the new scheduler, in addition to M (threads) and G (coroutines), a new component called P (processor) was introduced, as illustrated in Figure 1.11.

Figure 1.11

1.2.1 The GPM Model

In Go, threads serve as the entities that execute Goroutines, and the scheduler's role is to allocate runnable Goroutines to worker threads.

Figure 1.12

In the GPM model, there are several important concepts:

(1) Global Queue: It stores waiting-to-run Goroutines (G). The global queue is a shared resource accessible by any P, making it a global resource within the model. Consequently, mutual exclusion mechanisms are used for read and write operations on this queue.

(2) P's Local Queue: Similar to the global queue, it contains waiting-to-run Goroutines, but with a limited capacity, typically not exceeding 256 Goroutines. When a new Goroutine G' is created, it is initially added to the P's local queue. If the local queue becomes full, half of its Goroutines are moved to the global queue.

(3) P List: All P's are created at program startup and are stored in an array, with the maximum number of P's typically set to GOMAXPROCS (a configurable value).

(4) M: Threads (M) request tasks by obtaining a P. They retrieve G's from the P's local queue. When the local queue is empty, an M may attempt to fetch a batch of G's from the global queue and place them in its P's local queue or steal half from another P's local queue. The M then executes a G, and after completion, it fetches the next G from its P, repeating this process.

The Goroutine scheduler works in conjunction with the OS scheduler through the M, where each M represents a kernel thread. The OS scheduler allocates kernel threads to CPU cores for execution.

1、Regarding the Number of P's and M's:

(1) The number of P's is determined by the environment variable $GOMAXPROCS or the runtime method GOMAXPROCS(). This means that at any given moment during program execution, there are only $GOMAXPROCS Goroutines running simultaneously.
(2) The number of M's is constrained by Go's own limits. When a Go program starts, it sets the maximum number of M's, typically defaulting to 10,000. However, it's challenging for the kernel to support such a high number of threads, so this limit can often be disregarded. The runtime/debug package provides the SetMaxThreads() function to adjust the maximum number of M's. When an M becomes blocked, new M's can be created.
There is no strict relationship between the number of M's and P's. When an M becomes blocked, a P may either create a new M or switch to another available M. So, even if the default number of P's is 1, it is still possible to create multiple M's when needed.

2、Regarding the Creation Timing of P's and M's:

(1) P's are created when the maximum number of P's, denoted as 'n,' is determined. The runtime system generates 'n' P's accordingly.
(2) M's are created when there aren't enough M's available to associate with P's and execute the runnable Goroutines within them. For example, if all M's are blocked, and there are still many ready tasks in the P, the system will search for idle M's, and if none are available, it will create new M's.

1.2.2 Scheduler Design Strategies

Strategy One: Thread Reuse

The aim is to avoid frequent thread creation and destruction and instead focus on reusing threads.

** 1) Work Stealing Mechanism **

When the current thread has no runnable Goroutines to execute, it attempts to steal Goroutines from other threads bound to the same P, rather than terminating the thread. This process is illustrated in Figure 1.13.

Figure 1.13

It's important to note that the act of stealing Goroutines is initiated by a P, not an M, because the number of P's is fixed. If an M cannot acquire a P, it means that the M doesn't have a local queue, let alone the capability to steal from other P's queues.

** 1) Hand Off Mechanism **
When the current thread blocks due to a system call while executing a Goroutine, the thread releases the associated P and transfers it to another available thread for execution. This situation is depicted in Figure 1.14. If, in the GPM combination of M1, G1 is currently being scheduled and becomes blocked, it triggers the design mechanism for handoff. In the GPM model, to maximize the utilization of M and P's performance, a single blocked G1 should not indefinitely delay the work. So, in such situations, the handoff mechanism is designed to immediately release the P.

Figure 1.14

As shown in Figure 1.15, in order to release the P, the P is separated from M1 and G1. Since M1 is currently executing the active G1, the entire program stack space is contained within M1. Therefore, M1 should enter a blocked state along with G1. However, the P that has been released needs to be associated with another M. In this case, it selects M3 (or creates a new one or awakens a sleeping one if there is no M3 available at the time) to establish a new association. The new P then continues its work, either accepting new G's or implementing the stealing mechanism from other queues.

Figure 1.15

Strategy Two: Leveraging Parallelism

GOMAXPROCS sets the number of P's, with a maximum of GOMAXPROCS threads distributed across multiple CPUs simultaneously. GOMAXPROCS also limits the degree of concurrency; for example, if GOMAXPROCS = number of cores/2, it utilizes at most half of the CPU cores for parallelism.

Strategy Three: Preemption

In Coroutine, one must wait for a coroutine to voluntarily yield the CPU before the next coroutine can execute. In Go, a Goroutine can occupy the CPU for a maximum of 10ms, preventing other Goroutines from being starved of resources. This is a distinguishing feature of Goroutines compared to regular Coroutines.

Strategy Four: Global G Queue

In the new scheduler, the global G queue still exists but with weakened functionality. When an M attempts to steal Goroutines but cannot find any in the queues of other P's, it can retrieve Goroutines from the global G queue, as illustrated in Figure 1.17.

Figure 1.17

1.2.3 go func() Scheduling Process

Now, let's deduce what happens in the GPM model when executing the line of code go func().

(1) Creating a Goroutine using go func(), as depicted in Figure 1.18.

Figure 1.18

(2) There are two queues for storing G: one is the local queue of the scheduler P, and the other is the global G queue. The newly created G is initially stored in the P's local queue. If the local queue of P is already full, it will be stored in the global queue, as illustrated in Figure 1.19.

Figure 1.19

(3) G can only run within M, and each M must hold a P, establishing a 1:1 relationship between M and P. M pops an executable G from the local queue of P for execution. If P's local queue is empty, M retrieves a G from the global queue. If the global queue is also empty, M attempts to steal an executable G from other MP combinations, as depicted in Figure 1.20.

Figure 1.20

(4) The process of scheduling G for execution in an M follows a looping mechanism, as shown in Figure 1.21.

Figure 1.21

(5) When an M is executing a G, if a syscall or other blocking operation occurs, the M becomes blocked. If there are other Gs currently executing, the runtime detaches the thread M from P and creates a new operating system thread (reusing an available idle thread if possible) to serve this P, as illustrated in Figure 1.22.

Figure 1.22

(6) When the M completes the system call, the G will attempt to acquire an idle P for execution and be placed in the local queue of that P. If acquiring a P is not successful, the thread M enters a sleep state and joins the pool of idle threads. Meanwhile, the G is placed in the global queue, as depicted in Figure 1.23.

Figure 1.23

1.2.4 Scheduler Lifecycle

In the GPM model of the Golang scheduler, there are two distinct roles: M0 and G0.

M0

The main thread with the index 0 started after the program launches.
Located in the global command runtime.m0, requiring no allocation on the heap.
Responsible for executing initialization operations and launching the first G.
After launching the first G, M0 functions like any other M.

G0

Each time an M is started, the first Goroutine created is G0.
G0 is solely dedicated to scheduling tasks.
G0 does not point to any executable function.
Every M has its own G0.
During scheduling or system calls, M switches to G0, and scheduling is performed through G0.
M0's G0 is placed in the global space. The overall lifecycle of a Goroutine, including the roles of * M0 and G0, is illustrated in Figure 1.24.

Figure 1.24

Let's trace the following code and analyze the structures inside the scheduler. The code is as follows:

package main

import "fmt"

func main() {
    fmt.Println("Hello world")
}

The overall analysis process is as follows:

The runtime creates the initial thread M0 and Goroutine G0, associating the two.
Scheduler initialization: Initializes M0, stacks, garbage collection, and creates and initializes the P list consisting of GOMAXPROCS P's, as shown in Figure 1.25.

Figure 1.25

(3) The main() function in the example code is main.main, and there is also a main() function in the runtime, runtime.main. After the code is compiled, runtime.main will call main.main. When the program starts, it will create a Goroutine for runtime.main, known as the Main Goroutine, and then add the Main Goroutine to the local queue of P, as shown in Figure 1.26.

Figure 1.26

(4) Start m0, m0 has already bound P, it will get G from P's local queue, and get the Main Goroutine.

(5) G has a stack, M sets the running environment according to the stack information and scheduling information in G.

(6) M runs G.

(7) G exits, and returns to M to get the runnable G, repeating this until main.main exits, runtime.main performs Defer and Panic processing, or calls runtime.exit to exit the program.

The lifecycle of the scheduler almost fills the entire life of a Go program. All the preparations before the Goroutine of runtime.main are for the scheduler. The running of the Goroutine of runtime.main is the real start of the scheduler, which ends when runtime.main ends.

1.2.5 Visualizing GPM Programming

After grasping the fundamental model of GPM and the initialization lifecycle, can we visualize the data of a program's GPM model during its execution using some tools? Go provides two ways to inspect a program's GPM data.

1. go tool trace

The trace tool records runtime information and provides a visual web page. Below is a simple example. The main() function creates a trace, which runs in a separate Goroutine. Then main() prints "Hello World" and exits. The code is as follows:

// File: go_tool_trace.go
package main

import (
    "os"
    "fmt"
    "runtime/trace"
)

func main() {
    // Create a trace file
    f, err := os.Create("trace.out")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    // Start trace Goroutine
    err = trace.Start(f)
    if err != nil {
        panic(err)
    }
    defer trace.Stop()

    // main
    fmt.Println("Hello World")
}

Run the program, and you will find a trace.out file in the current directory. You can use the go tool to open this file. The results are as follows:

$ go run trace.go
Hello World

Here, a trace.out file is generated in the current directory. You can use the go tool to open the file. The result is as follows:

$ go tool trace trace.out
2021/02/23 10:44:11 Parsing trace...
2021/02/23 10:44:11 Splitting trace...
2021/02/23 10:44:11 Opening browser. Trace viewer is listening on http://127.0.0.1:33479

Open the browser at http://127.0.0.1:33479, click "view trace" to see the visualized scheduling flow, as shown in Figure 1.27.

Figure 1.27

What is shown in Figure 1.27 is the complete flow of Go routines associated with the current code execution process, along with their respective timelines.

(1) G Information. Clicking on the row labeled "Goroutines" in the visualized data bar reveals detailed information, as depicted in Figure 1.29.

Figure 1.29

There are a total of two Gs in the program. One is the special G0, an essential initialized G for each M. The other is G1, the Main Goroutine responsible for executing the main() function, which remains in the runnable and running states for a period of time.

(2) M Information: Clicking on the "Threads" row in the visualized data bar reveals detailed information, as shown in Figure 1.30.

Figure 1.30

(3) P Information: Now, let's examine the information in the PROCS section on the left. The listed items, such as Proc 0 and Proc 1, represent details about the processors. As shown in Figure 1.31, Proc 0 and Proc 1 are both associated with processor-related information.

Figure 1.31

G1, which invokes main.main, spawns the trace goroutine G18. G1 is running on P1, while G18 is executing on P0. In this scenario, there are two P's, and each P must be associated with an M for scheduling G's. Next is the information about M, as illustrated in Figure 1.32.

Figure 1.32

In the figure, you can observe that when G18 is running on P0, there is additional data for M in the "Threads" row. Clicking on it reveals detailed information, as shown in Figure 1.33.

Figure 1.33

The newly added M2 is dynamically created by P0 to execute G18.

2. debug trace

// File: trace2.go
package main

import (
    "fmt"
    "time"
)

func main() {
    for i := 0; i < 5; i++ {
        time.Sleep(time.Second)
        fmt.Println("Hello World")
    }
}

Compile the above code:

$ go build trace2.go

Next, run it with debug trace:

$ GODEBUG=schedtrace=1000 ./trace2

The output analysis is as follows:

(1) SCHED: Debug information output flag, indicating that this line is from the Goroutine scheduler.
(2) 0ms: The time from the program start to the output of this log.
(3) gomaxprocs: Number of P (Processors), in this example, there are 2 P because the default P property matches the CPU core count. You can set it using GOMAXPROCS.
(4) idleprocs: Number of P in idle state; the difference between gomaxprocs and idleprocs tells us the number of actively executing P in Go code.
(5) threads: Number of OS threads/M, including the number of M used by the scheduler, plus runtime's own threads like sysmon.
(6) spinningthreads: Number of OS threads in a spinning state.
(7) idlethreads: Number of OS threads in an idle state.
(8) runqueue=0: Number of G in the Scheduler's global queue.
(9) [0 0]: Number of G in the local queues of the 2 P.

Viewing GPM data can be done through both go tool trace and Debug trace. Both provide insights into the distribution of GPM during the program's execution. go tool trace additionally offers a local web-based visualization, while Debug trace outputs information to the terminal.

1.3 Go Scheduler: Comprehensive Analysis of Scheduling Scenarios

This section will delve into a comprehensive analysis of scheduling variations in Golang's GPM model across different scenarios. These scenarios cover almost all situations encountered during the scheduling process of Goroutines. Analyzing these scheduling scenarios helps us better appreciate the charm of the Golang scheduler.

1.3.1 Scenario 1: G1 Creates G2

This scenario primarily demonstrates the locality of GPM scheduling. P owns G1, and M1 starts running G1 after acquiring P, as illustrated in Figure 1.34.

Figure 1.34

When G1 uses go func() to create G2, for the sake of locality, G2 is given priority and is added to the local queue of P1, as depicted in Figure 1.35.

Figure 1.35

Scenario 1 mainly illustrates the locality of the GPM scheduler. By default, if a G1 creates a new G2, G2 is prioritized and placed in the local queue of P1. This is because G1 and G2 share similar memory and stack information, making the cost of switching between them relatively small for M1 and P. This is a characteristic that locality aims to preserve.

1.3.2 Scenario 2: G1 Execution Completed

Currently, M1 is bound to G1, and the local queue of P1 contains G2. Meanwhile, M2 is bound to G4, and the local queue of P2 contains G5, as illustrated in Figure 1.36.

Figure 1.36

After the completion of the execution of G1, the Goroutine running on M1 is switched to G0, as shown in Figure 1.37.

Figure 1.37

G0 is responsible for coordinating the context switch of Goroutines during scheduling. It retrieves G2 from the local queue of P, switches from G0 to G2, and starts executing G2, achieving the reuse of thread M1, as illustrated in Figure 1.38.

Figure 1.38

Scenario 2 primarily demonstrates the reusability of M in the GPM scheduling model. When a Goroutine (G1) has completed its execution, regardless of the availability of other Ps, M will prioritize fetching a pending G from its bound P's local queue. In this case, the pending G is G2.

1.3.3 Scenario 3: G2 Creates Excessive Gs

Suppose each P's local queue can only store 4 Gs, and the local queue of P1 is currently empty. If G2 needs to create 6 Gs, then G2 can only create the first 4 Gs (G3, G4, G5, G6) and place them in its current queue, as illustrated in Figure 1.39. The excess Gs (G7, G8) will not be added to the local queue of P1. The detailed explanation of where the excess Gs are placed will be covered in Scenario 4.

Figure 1.39

Scenario 3 illustrates that if a G creates an excessive number of Gs, the local queue may become full. The excess Gs need to be arranged according to the logic explained in Scenario 4.

1.3.4 Scenario 4: G2 Locally Full, Creating G7

When G2 attempts to create G7 and discovers that P1's local queue is full, it triggers a load balancing algorithm. This algorithm moves the first half of the Gs from P1's local queue, along with the newly created G7, to the global queue. As illustrated in Figure 1.40, G3, G4, and the recently created G7 are placed together in the global queue, while G5 and G6 from P1's local queue are moved to the front of the queue.

Figure 1.40

When these Gs are moved to the global queue, their order becomes disrupted. Therefore, G3, G4, and G7 are added to the global queue in a non-sequential manner.

1.3.5 Scenario 5: G2 Local Queue Not Full, Creating G8

When G2 creates G8 and the local queue of P1 is not full, G8 will be added to the local queue of P1.

Figure 1.41

Newly created G8 is given top priority and is placed into the local queue first, adhering to the locality property. Since the local queue already contains other Gs at its head, G8 will sequentially enter from the queue's tail, as depicted in Figure 1.41. When the scheduling of G2 is completed, the subsequent scheduled G should be G5.

1.3.6 Scenario 6: Waking Up a Sleeping M

In the GPM model, when creating a new G, the running G will attempt to wake up other idle P and M combinations for execution. This implies that there might be surplus M instances from previous operations, and these M instances won't be immediately reclaimed by the operating system. Instead, they are placed in a sleeping thread queue. The condition triggering the awakening of an M from the sleeping queue is attempting to create a new G.

As illustrated in Figure 1.42, currently, only P1 and M1 are bound, and P1 is executing G2. When G2 attempts to create a new G, it triggers an attempt to fetch an M from the sleeping thread queue and attempts to bind it to a new P along with the local queue of P.

Figure 1.42

G2 is waking up M2 from the sleeping queue, as shown in Figure 1.43.

Figure 1.43

If M2 discovers available P resources, it will be activated and bound to P2, as illustrated in Figure 1.44.

Figure 1.44

At this point, if M2 is bound to P2, it needs to find other Gs to execute. Each M has a G0 responsible for scheduling other Gs. Therefore, if there are no normal Gs available for M2 and P2, G0 will be scheduled by P. If P's local queue is empty and P is in G0, the combination of M2, P2, and G0 is referred to as a spinning thread, as shown in Figure 1.45.

Figure 1.45

One characteristic of a spinning thread is the continuous search for Gs. The process of finding Gs will lead to the flow of scenarios 7 and 8 to be introduced next.

1.3.7 Scenario 7: M2 Awoken Fetches a Batch of Gs from the Global Queue

M2 attempts to fetch a batch of Gs from the global queue (referred to as GQ) and places them in P2's local queue (referred to as LQ). The number of Gs that M2 fetches from the global queue follows the formula:

Referencing the source code as follows:

//usr/local/go/src/runtime/proc.go

//…

// Steal from the global queue, must be called with the scheduler locked
func globrunqget(_p_ *p, max int32) *g {
    // Return nil if the global queue is empty
    if sched.runqsize == 0 {
        return nil
    }

    // Per-P portion, take all if there is only one P
    n := sched.runqsize/gomaxprocs + 1
    if n > sched.runqsize {
        n = sched.runqsize
    }

    // Cannot exceed the maximum number to take
    if max > 0 && n > max {
        n = max
    }

    // Check if it can fit the remaining space in the local queue
    if n > int32(len(_p_.runq))/2 {
        n = int32(len(_p_.runq)) / 2
    }

    // Update the remaining space in the local queue
    sched.runqsize -= n
    // Get the head G from the global queue
    gp := sched.runq.pop()
    // Decrement the count
    n--

    // Continue fetching the remaining n-1 Gs from the global queue and place them into the local queue
    for ; n > 0; n-- {
        gp1 := sched.runq.pop()
        runqput(_p_, gp1, false)
    }
    return gp
}
//…

At least 1 G is fetched from the global queue each time, but it avoids moving too many Gs from the global queue to the P local queue at once, leaving some for other Ps. This aims to achieve load balancing from the global queue to the P local queue.

Figure 1.46

If M2 is in a spinning thread state at this moment, the global queue has a count of 3, and the local queue capacity of P2 is 4, as shown in Figure 1.46. According to the load balancing formula, the number of Gs fetched from the global queue at once is 1. M2 will then fetch G3 from the head of the global queue and add it to P2's local queue, as illustrated in Figure 1.47.

Figure 1.47

Once G3 is added to the local queue, it needs to be scheduled by G0. G0 is then replaced by G3, and at the same time, G7 and G4 from the global queue will move to the head of the queue in sequence, as shown in Figure 1.48.

Figure 1.48

Once G3 is scheduled, M2 is no longer a spinning thread.

1.3.8 Scenario 8: M2 Steals from M1

Assuming G2 has been running on M1 all along, after two rounds, M2 has successfully acquired G7 and G4 from the global queue and added them to the local queue of P2, completing their execution. Both the global queue and the local queue of P2 will be empty, as shown in Figure 1.49.

Figure 1.49

M2 will once again be in a state of scheduling G0. In this state, a spinning MP combination continuously looks for Gs that can be scheduled. Otherwise, waiting in idle would mean wasting thread resources. Since the global queue is empty, M2 will perform stealing: it steals half of the Gs from another P that has Gs and adds them to its own P's local queue. P2 takes half of the Gs from the tail of P1's local queue, as shown in Figure 1.50.

Figure 1.50

In this case, half means only one G8. Therefore, M2 will attempt to steal G8, placing it in P2's local queue, as illustrated in Figure 1.51.

Figure 1.51

The subsequent process is similar to the previous scenario. G0 will schedule G8, and as a result, M2 will no longer be in a spinning thread state, as shown in Figure 1.52.

Figure 1.52

1.3.9 Scene 9: Maximum Limit of Spinning Threads

M1's local queue with G5 and G6 has been taken by M2 and executed, and currently, M1 and M2 are running G2 and G8, respectively, as shown in Figure 1.53. In the GPM model, the GOMAXPROCS variable determines the number of P's. The number of P's is fixed in the GPM model and cannot be dynamically added or removed.

Figure 1.53

Assuming GOMAXPROCS=4, the number of P's is also 4. At this point, P3 and P4 are bound to M3 and M4, as illustrated in Figure 1.54.

Figure 1.54

M3 and M4 do not have any Goroutines to run, so they are currently bound to their respective G0, and both M3 and M4 are in a spinning state, continually searching for Goroutines.

Why allow M3 and M4 to spin? Spinning is essentially running, and when a thread is running without executing a Goroutine, it becomes a waste of CPU. Why not destroy the context to save CPU resources? Because creating and destroying CPU also incurs a time cost, and the philosophy of the Golang scheduler GPM model is to have an M ready to run a new Goroutine immediately upon its creation. If we were to destroy and recreate, it would introduce delays and reduce efficiency. Of course, the system also considers that having too many spinning threads wastes CPU, so there is a maximum limit of GOMAXPROCS spinning threads in the system (in the current example, GOMAXPROCS=4, so there are a total of 4 P's). Any excess idle threads will be put to sleep. If there is an M5 currently running but no available P to bind with, M5 will be added to the sleeping thread queue, as shown in Figure 1.55.

Figure 1.55

1.3.10 Scenario 10: Blocking System Call in G

Assuming that, in addition to M3 and M4 being spinning threads, there are also idle threads M5 and M6 (not bound to any P, noting that there can be a maximum of 4 P's, so the number of P's should always be M>=P). G8 creates G9, as illustrated in Figure 1.56.

Figure 1.56

If G8 makes a blocking system call at this point, M2 and P2 are immediately unbound. P2 performs the following checks: if P2's local queue has a G or the global queue has a G that needs execution, and there is an idle M available, P2 will promptly wake up one M to bind with (if there are no sleeping M's, a new one will be created for binding). Otherwise, P2 will be added to the list of idle P's, waiting for an M to acquire an available P.

In this scenario, P2's local queue has G9, and it can be bound to another idle thread, M5, as shown in Figure 1.57. During the blocking period of G8, M2's resources are temporarily occupied by G8.

Figure 1.57

The above is a scenario where a G is blocked due to a system call.

1.3.11 Scenario 11: G undergoes a non-blocking system call

Continuing from Scenario 10, G8 creates G9. Assuming that the previous system call made by G8 has ended, it is currently in a non-blocking state, as shown in Figure 1.58.

Figure 1.58

In Scenario 10, even though M2 and P2 become unbound, M2 remembers P2. Then, when G8 and M2 exit the system call, M2 attempts to reacquire the previously bound P2. If P2 is available, M2 will rebind with P2. If it cannot be obtained, M2 will handle it in other ways, as shown in Figure 1.59.

Figure 1.59

In the current example, at this moment, P2 is bound to M5, so M2 naturally cannot acquire P2. M2 will attempt to find an available P from the idle P queue, as shown in Figure 1.60.

Figure 1.60

If there are still no idle Ps in the queue, M2 effectively cannot find an available P to bind to. G8 will be marked as runnable and added to the global queue. M2, without a bound P, enters a sleeping state (long-term sleep may lead to GC recycling and destruction), as shown in Figure 1.61.

Figure 1.61

1.4 Summary

The Go scheduler is lightweight and straightforward, capable of efficiently managing the scheduling of Goroutines, providing Go with native and powerful concurrency capabilities. The essence of Go's scheduler lies in distributing a large number of Goroutines across a small number of threads, utilizing parallelism to achieve robust concurrency. The principles behind Go's scheduler go beyond the 11 scenarios described in this article and the knowledge within its scope. The scheduling algorithm in Go is continually evolving and upgrading. If you wish to delve deeper into the Go scheduler, it requires a more substantial investment of time and patience to study the source code of the Go scheduler's GPM model.