Discussion on: Using Goroutines is Slower??

View post

Replies for: Our second example of calculating the sum is CPU-bound because we are constantly calculating until we are done with the last integer in the list. ...

Are you a wizard? I think I just found one! Thank you for the example code, will definitely apply the concepts in my code from now on.

Because I am curious, I have some questions! If you don't mind me asking:

I couldn't help but to think that using one goroutine per core be slightly underwhelming? I understand that it's probably going to be really efficient because the scheduler has to do less work, but that runtime.NumCpu() cap seems to... disrespect the whole "lightweight concurrency" thing Go has going for it. I thought the point of it was to be able to use more of it without losing performance?
On the same line, based on your experience, what would you say is the sweet spot for the number of goroutines? Because too little seems to be disrespecting Go, and using too much will overwhelm the scheduler with performance tax.

Again, thank you for the thorough example!

peerreynders • Jul 12 '22 • Edited

Are you a wizard?

Far from it. As it is working through A Tour of Go over three years ago was all I had to go on. And some mad googling. And some 3+ years of using Erlang/Elixir. So you've been warned about my lack of expertise.

think that using one goroutine per core be slightly underwhelming?

The number of cores represent your maximum effective parallelism. While each core can support thousands of threads, all those threads share the core and switching from one thread to another imposes the processing overhead of a context switch. Threads work because most of the time they are blocked, waiting for something else to happen, so another thread may as well get some work done.

In this case the thread won't be blocked by IO so (short of cache misses) it can just tear through its work—no need to slow it down with context switches.

As it is runtime.NumCPU() reports 8 cores. But 445229 / 143921 = 3.09356522. The CPU has 4 physical, hyper-threaded cores; so each core is reported as 2. For this workload there is no benefit from hyper-threading.

"If I have a 10 core computer I just want it to run 10 times faster, if I have a 100 core computer it should run 100 times faster. When we program in Erlang this is approximately true. Our goal is that applications run 0.75 x N times faster on an N-core computer." (From the description of:)

So with a 4 (physical) core CPU a 3x speedup is about a good as you can get.

"disrespect the whole "lightweight concurrency" thing Go has going for it."

The benefit of lightweight concurrency is that having lots of units of concurrency that aren't doing much of anything most of the time is incredibly cheap. Lightweight concurrency only means that the unit of concurrency isn't tied to a specific thread. Naive concurrency maps one unit of concurrency to one thread; all the coordination work is done by the operating system and the computational context of a thread is fairly heavy.

In lightweight concurrency the thread is owned by a scheduler (and typically there is one scheduler per core) and it's the scheduler which incurs the cost of coordinating/scheduling the work of the goroutines it is responsible for. I don't know how it's done in Go but on the Erlang VM a scheduler can "steal" BEAM Processes from overloaded schedulers; so in that case work isn't even bound to a core.

In lightweight concurrency a unit of concurrent work is represented primarily by its current working state. That means there is very little overhead to having lots of units (goroutines/BEAM processes) that don't do anything most of the time (because they are blocked or waiting for something to happen) other than the memory necessary to preserve their current local state—something which can't be said of threads.

the point of it was to be able to use more of it without losing performance.

The point is to be able to break work down to such a fine grain so that it is possible to always make progress on something while everything else is blocked. But that level of micromanagement incurs a coordination cost that reduces the capacity to perform work. So there is a balance to be struck; break it down so far that so that core idle time is minimal but not so far that coordination overhead eats into your capacity to perform work.

what would you say is the sweet spot for the number of goroutines?

The answer that everybody hates: "it depends". In this case the routines could go independently full bore so it makes little sense to have more than the number of (virtual) cores. In other cases you may have routines that don't do much of anything other than hold their local state but you need to have thousands of them because each has a distinct identity.

Though due to Go's CSP orientation I'd expect that to happen a lot less than in Erlang/Elixir.

Jacob Kim • Jul 14 '22

Wow, I didn't know I'd run into deep insights like this. Trust me, you have MUCH more experience than I do, and I am the student in this relationship xD Thank you so much for all the help!