what is the Go scheduler? Why do we even care?
Well, it is the behind-the-scenes orchestrator of any GO program. In order to explain this, Let's see a demo program and understand what are the jobs of the go scheduler and why we should care about it.
func main(){
someTasks := getTask()
for _,task := range someTasks{
// processing each task in seperate goroutine
go processTask(task) // creates goroutines.
}
// waits for the result Channel
for res := range resultCh{ // waits
log.Info(res)
}
}
func processTask(t task){
// fetching some meta-data about the task.
go fetchMetaData(t) // creates goroutine
UsefulInfo := complexComputation(t)
// writing to a file.
file,err := os.OpenFile() // blocking call
....
}
we have the main goroutine. It creates some processing goroutines. And then it waits on a channel. Each processTask goroutine creates another goroutine for fetching some meta-data and then it does some long-running complicated processing, and then it writes to the file system.
Now the scheduler is what makes it so that
- The goroutines created actually run, (and that they run concurrently, independently).
- It pauses and resumes them when they block on a channel or mutex operation.
- It's also the thing that coordinates network IO and blocking system calls like that file system call.
- It also runs runtime tasks like garbage collection.
It can do it all for 10s and even hundreds of thousands of goroutine and its design and scheduling decisions about what to run when have a huge impact on the performance of any program.
Why does Go need a scheduler?
So, Go uses goroutines, or what we call user-space threads. Now they're conceptually similar to kernel threads. But the big difference is that they're managed entirely by the GO runtime. Now, why would go choose to use these user-space threads? And it's because they can be implemented to be more lighter weight and faster. Goroutine, for example, has a smaller memory footprint than kernel threads. They're also a lot faster to create and destroy and context switch between. So that's nice. We want these lightweight user-space threads. There's only one problem. The operating system only knows how to schedule to put kernel threads on the hardware on your CPU cores. So how would these goroutines run? And this is where the go scheduler comes in. It puts goroutines on kernel threads that run on the CPU.
What should be the basic properties of Go scheduler?
- use a small number of kernel threads so our first point here would be these operating system threads (kernel threads), they're expensive, so we want to use a small number of them.
- should support high concurrency Our second point comes from the fact that GO is intended for highly performant systems. So we want to be able to support high concurrency. The applications should be able to create lots of goroutines and our schedulers should be able to keep up.
- leverage parallelism i.e scale to N cores And a third point has to do with the fact that we live in an age of multicore. We want to leverage hardware parallelism. We want to scale to N cores. Basically, if your machine has N cores, it can run up to N things in parallel because it can run up to N threads in parallel.
This is how the scheduling works, where when the goroutine makes a call that switches into the scheduler, under the hood the runtime switches your program into the scheduler. So it's at those events that create events and the channel receives. Now what the scheduler does of this time, it might actually continue running the same goroutine on the thread. This is what happens when you create a goroutine that goroutines can keep running. So in that picture, we see Gmain keeps running on the main thread. In the case of a channel receive, however, we want the goroutine to block so Gmain is switched out for G1.
So how do we go about multiplexing these goroutines onto kernel threads? you'll see we can tease it apart into two simpler questions. The question of when to create these kernel threads and a question of how to distribute goroutines across threads.
Run Queue
So our program creates these goroutines we need somewhere to know that these goroutines are ready to run, and they need to be scheduled on the kernel threads. So to do this we're going to use a heap-allocated struct called a runqueue
which is a FIFO Queue. So when there's a create event, the New goroutine is added to the tail of the queue and when it's time to run a goroutine, you run it from the head of the queue.
Re-use threads
Well, let's see, we want threads, but we don't want too many thread creations and destructions. So what do we do? Well, we could just reuse threads. The way the scheme works is will create a thread when needed, so when there is a goroutine to run and all our threads are busy. But then when the thread frees up, say, the goroutine exits, rather than destroying it, would keep it around for reuse. And the way we do this is we will park the thread, which means we'll put it to sleep so it frees up the CPU and we'll track it so we can find it later when we need to wake it up. Now for the question of how to distribute goroutines across the threads we create, will have that runqueue and all threads will get their work from that runQueue.
So we have the main goroutine running on the main thread. Now Gmain creates G1, so we put G1 on the runQueue. Now we do our checks. There's work to do. All threads are busy, so we're going to start T1. Now T1 comes along and it pops off G1 from the run Queue and it runs it. Now G1 exits. Rather than destroying T1, we're going to park it and put it on that idle threads list. G2 is created by Gmain and this time we're going to do the same thing added to the run queue. But. There is an idle threat, so we're just going to reuse that thread and we're going to have T1 run G2 this time.
Unbounded number of threads - contention
So at this point, we have a scheme. There's concurrency, it provides parallelism, and it nicely reduces thread creations. Does anybody see any problems with this scheme? Well, for one, we have many threads mutating a single data structure, so we need a lock, and locks are inherently bad for scalability. What they do is they effectively serialize goroutines scheduling. We can only schedule 1 goroutine at a time. But there's a bigger problem with this scheme. This is what happens if Gmain creates 10,000 long-running goroutines in quick succession. We're going to create 10,000 threads, so in this scheme, we can still have an unbounded number of threads accessing the runQueue and an unbounded number of threads accessing a shared data structure with the lock results in contention which means it is not scalable.
Limiting the number of threads that can access the runQueue
The problem is these unbounded threads access the run queue. So hey, why don't we just? Limits the number of threads that can access the runQueue. This scheme is very similar to the scheme we just discussed, except there's one extra check when we create a thread, before we do that, we check to see if the number of goroutines running threads is already equal to or greater than the limit.
Now, an important thing to note here is that this limit applies to threads accessing the runQueue only because we're trying to solve the contention problems. So threads running goroutines and it does not apply to threads blocked in system calls. Now as before, we're going to keep threads around for reuse and we're going to use the runQueue to give them work.
Setting the limit to the number of CPU cores and that way we get all the parallelism we want. So we're going to limit the number of threads that can be running goroutines at any time to 2. Now we have 2 goroutine running threads already and now Gmain is going to create another goroutine. Now we're going to add G2 to the runQueue. We're going to do our checks, but this time there's an extra check, and wait a minute. We already have 2 goroutine running threads, so this time we're not going to create another thread. Well, what happens to G2 then just sitting in the runQueue? Well, don't worry. At a future scheduling point, like when G1 blocks on the channel, G2 will be scheduled.
Avoid shared runQueue
What happens when the number of cores increases? Say you upgrade your hardware and you suddenly have 124 cores. Well, as the number of cores increases. That limit goes up, which means we can have more threads running goroutines and more threads accessing the runQueue and before you know it, contention again.
This idea of setting the number of goroutine running threads to the number of CPU cores is still a good idea, right? It means we're maximally leveraging parallelism. We literally cannot do better because we're limited by the hardware. So we'll hold onto it. The problem, in this case, was this shared runQueue.
Let's just use distributed runQueues. The way this scheme works is on an N-core machine we will have N runQueue and we're going to use one runQueue per thread. So a thread claims a runQueue to run goroutines, which means it inserts and removes goroutines from that runQueue only for the time it's associated with it. Now, as before, we're going to reuse threads.
Use distributed RunQueue
In this case we have two runQueue because we're running on 2 core, we have runQueuea and runQueueb. The main thread is currently associated with runQueuea. Now the main thread creates a goroutine, G1 it's added to its runQueue. As before, we do our checks. We can have two goroutines running threads and we have only one, so we're going to start a new thread. We're going to start T1. The T1 is going to claim the runQueueb and wait a minute. Its runQueue is empty. There is work in the system, but it's in the other runQueue. So the idea here, it's called work stealing, if the local runQueues is empty, the thread is going to pick another random runQueue and steal half its workload. Now the nice thing about work-stealing is it organically balances work across threads, right? Because the free threads will steal work from the overloaded threads.
OK, Let's apply work-stealing to solve our problem. So we're in this situation where T1 has an empty local runQueue, so it's just going to steal G1 and then it can run it. So far it scales nicely with the number of CPU cores because we have effectively per Core runQueues and threads don't contend because these runQueues are independent. The work across threads is balanced too, because of work-stealing.
Goroutine starvation
So once we can continue and finally get to run G1. Now, this is the part of the program we care about. G1 is going to create G3 and then it's going to do that blocking system call. OK. So T1 runs G1 which creates G3 which gets put into its runQueue. Now we're not going to start a thread this time because we already have two goroutines running threads on our 2 core.
Now G1 does that blocking system call. Well, when G1 does that syscall it is going to block, but T1 is also going to block.
The problem, we have a local run queue with work, but its thread is blocked. Well, all we need really is a mechanism to transfer the blocked threads runQueue to another thread. This mechanism can simply be a background thread, a monitor. That takes the runQueue and gives it away.
Now, why do we need a background thread?
Why can't the thread itself before entering the system call just give its runQueue away? That would be simpler and the reason is that we don't know ahead of time. For how long a system call is going to block, right we don't want the thread to give away its runQueue and finish up the system called really quickly and then be out of work. And so we use this background thread.
Who do we give away the runQueue to? If we have parked threads, we can wake them up and give them the runQueue and if we don't have parked threads, we can start a thread.
Now, wait a minute. What about our limits to the number of goroutine running threads? Well, remember that's exactly what it is. It is a limit on the number of goroutines running threads. The original thread is now blocked in a system call, so it's effectively freed up a slot for another goroutine running thread. This mechanism is called handoff and it's nice because it prevents goroutine starvation. It means G3 will get to run.
Handoff
OK, T1 is blocked and G3 is sitting in the runQueue. We're going to hand it off. So that background thread does its checks, it finds the runQueue with work, it sees the thread is blocked and blocked for a little bit of time, so it starts the thread T2 and then the monitor is going to Hand, the runQueue to T2 and then it can run G3.
We have finally reached the end of our program and even better, we have a scheme that seems to work. It scales nicely, were solved goroutines starvation. We've solved unbalanced workloads.
Non-cooperative CPU-bound computation
Now everything we've talked about so far, all the scheduling works as long as the program makes those calls to call into the scheduler right? It happens under the hood, but the program does that goroutine creation, does that channel blocking its cooperative? What happens if we have a good team that does not cooperate, that does not make any of those calls like say, a complicated algorithm is a CPU-bound computation that runs for a long long time and never yields the CPU? Well, if we never call into the scheduler, nothing else is going to get scheduled. So such a CPU hog can starve the runQueue.
Cooperative preemption
So what we need is a mechanism to preempt these goroutines, so they're running for too long. We need a mechanism to force them to yield to the scheduler. So the Go scheduler implements preemption. This is a form of preemption called cooperative preemption and the way it does this is, there is a background monitor thread called the systmon which stands for system monitor, and the systmon and basically detects long-running goroutines, so that has been running for 10 milliseconds and it un-schedules them when possible. Now, where would it put these preempted goroutines? We don't want to put them back on the runQueue. It would be unfair to the other goroutines that it effectively starved. So where do we put them? The Go scheduler puts them on a global runQueue that's right the Go scheduler has a global runQueue in addition to those distributed per core runQueues. It uses this as a lower priority queue of sorts, so threads check it less frequently than their local runQueue, so contention is not a problem in this case.
Closure
OK, no more surprises. I promised with that, we now have a full understanding of the main ideas, both big and sneaky, behind the Go scheduler. We started out with a list of goals. How did we do with our goals? Use a small number of kernel threads. We can support high concurrency and we can leverage parallelism. We scale to N-cores and this falls out of those three ideas that we discussed.
Let's move on to the harder questions. What are the limitations of the scheduler?
Well, for one, there is no notion of goroutine's priority. It uses a first in, first out runQueue vs Linux scheduler which uses a priority queue. Now the cost-benefit tradeoff is doing this might not actually make sense for
go programs.
The second limitation is there's no strong preemption, so there is no strong fairness in latency guarantees. It's entirely possible for a goroutine in certain cases to bring the inspire system to slow down in a fault.
And finally, the third limitation that I want to touch upon today is the scheduler is not aware of the actual hardware topology, so there's no real guaranteed locality between the data and the Goroutine computation, and with that we have come to an end and thank you for reading.
Gopher Artwork credit
Maria Letta
Ashley Mcnamara
Reference:
Go's Work Stealing Scheduler (sourcegraph.com)
Go scheduler talk by Kavya Joshi
Scalable Go scheduler talk by Dmitry Vyukov
Top comments (0)