DEV Community

Sahil Sarwar
Sahil Sarwar

Posted on • Originally published at sahilserver.substack.com

Designing Go-routines

Originally published on my Substack blog Brain Bytes and Binary. I deep-dive into Go internals, backend systems, and more.

To understand the architecture that shapes goroutines, we need to understand an overstated but often ignored topic: Concurrency vs. Parallelism.

You must have seen the image below hundreds of times.

Concurrency vs Parallelism

Concurrency and parallelism are often confused, but they are fundamentally different concepts. While parallelism refers to executing multiple tasks simultaneously (limited by the number of CPU cores), concurrency is about structuring a program to handle multiple tasks efficiently, even if they are not running simultaneously.

So, what does this mean?

We can only run a certain number of processes simultaneously (equal to the number of CPU cores).

Designing an Efficient Concurrency Model

For an effective concurrency model, we need:

  • The ability to handle a large number of concurrent tasks.
  • A lightweight threading mechanism (more efficient than OS threads).

This leads us to the following hierarchy:

  • A CPU core can have multiple OS threads.
  • A thread can manage numerous goroutines.

However, only a limited number of goroutines can run in parallel (equal to the number of CPU cores).

So, how does the hierarchy look in our concurrency model now?

GMP Architecture in Golang

But I will highlight the fact again that from the above example, it’s evident that we can only run 2 goroutines truly in parallel (since we only have 2 cores).

How do we make sure that 1 thread can handle multiple goroutines concurrently?

We can do that by making some kind of scheduler that can schedule the goroutines, maintains its state, checks if it needs CPU/memory or it needs to terminate it.

The Go Scheduler and G-M-P Model

The goroutine scheduler works in the following G-M-P model -

Goroutines (G)

A Goroutine represents a lightweight user-space thread.

It contains registers, a small stack (starting at 2KB), and a pointer to a P (Processor).

The runtime schedules Gs dynamically on available Ps.

Machine (M)

Represents an OS thread that executes Goroutines.

An M can be blocked (when calling a system function, I/O, or network calls), and the runtime may create a new M if needed.

An M can execute only one Goroutine at a time but can switch between them.

Processor (P)

Represents a logical CPU and is responsible for scheduling Go routines.

The number of Ps is equal to GOMAXPROCS, which determines the maximum number of Goroutines running in parallel.

Each P has its local run queue (holding Goroutines ready to execute).

This way, we can make sure the scheduler efficiently maps thousands of goroutines onto a limited number of OS threads. This is how Go achieves high performance without the overhead of traditional threading models.

Let’s look at how Go makes sure it utilizes the CPU to its maximum.

Efficient CPU Utilization

Let’s say we have 2 Processors -

P1 - running 30 goroutines

P2 - running 50 goroutines

It’s natural to assume that P1 might complete all its goroutines quicker than P2. What then?

Once a P exhausted its list of goroutines, it can steal other P(s) goroutines and run them instead, making sure we reduce contention and increase CPU utilization.

We can also keep a global queue (where goroutines waiting for execution are stored) from which the P(s) can pick goroutines once their run queue gets exhausted.

Global and local queues for goroutines

We know goroutines are lightweight, but why is that so? And what makes them fast?

Let’s look at memory footprints for goroutines.

Memory Management in Goroutines

Unlike OS threads that start with a fixed-size stack (around 2MB), Goroutines start with a small, dynamically growing stack (around 2KB).

How Stack Growth Works

When a Goroutine runs out of stack space, it gradually increases its stack memory.

Instead of allocating a larger contiguous stack (like OS threads), Go performs a process called stack splitting:

  • A new segment of memory is allocated.
  • The previous stack’s contents are copied to the new segment.
  • The old stack can be garbage-collected if unused.

Advantages of Stack Splitting

Memory Efficiency – It avoids allocating large, unused memory upfront.

Less Fragmentation – Since new segments are added dynamically, it avoids fragmentation issues in normal threading models.

Faster Allocation – Since Go doesn’t need to reserve a large stack initially, Goroutines can be created faster.

Now, our goroutines are created. But how can we be sure that it is run at its full potential?

Let’s look at how the scheduler ensures fairness when running the goroutines.

Scheduling: Preemption & Fairness

Preemption is the ability of the scheduler to interrupt a running Goroutine and switch to another one to ensure fairness.

Preemptive Scheduling

Preemptive scheduling is used to interrupt long-running Goroutines automatically.

The runtime inserts safe points in the function where preemption can occur.

Internally, Go uses a round-robin scheduling approach within each P, along with preemption.

How Preemption Works in Go

When a Goroutine runs for too long without exiting, the scheduler sets a preemption flag on it.

The Goroutine is interrupted at a safe point (normally at the beginning of a function).

The runtime scheduler then moves it to the global run queue, allowing other Goroutines to run.

How Safe Points Are Determined

The compiler inserts preemption checks in function prologues (prologues - prolouge is what happens at the beginning of a function. Its responsibility is to set up the stack frame of the called function).

These checks are low-cost and allow the runtime to preempt a Goroutine when needed.

However, certain functions (long loops without function calls) may not have safe points, making them harder to preempt.

Blocking Tasks in Goroutines

When a Goroutine encounters blocking tasks (e.g., I/O, network calls), the scheduler must track its state efficiently.

How Go Handles Blocking Tasks

The scheduler not only swaps the goroutines from multiple queues, it also maintains the states of the goroutines, just as the CPU maintains the states of processes when swapping different processes.

A Goroutine is paused when it:

  • Completes execution.
  • It is preempted by the scheduler.
  • Blocks on I/O, locks, or network calls.

Once the blocking operations are done, it will mark the goroutines as “READY” and schedule them to the ready queue for the P (processors).

This begged a question in my head.

If a goroutine is making an HTTP request, this is a blocking call, meaning the goroutine gets swapped for another.

Then how does the scheduler know if that HTTP request was successful and the data is available now so that it can restart the goroutine?

I/O Polling and Netpoller

Go uses OS-level pollers (e.g., epoll on Linux, kqueue on macOS) to detect when data is available.

Instead of checking manually, Go relies on event-driven notifications from the OS.

When data is ready, the netpoller wakes up the Goroutine and schedules it for execution.

Go uses I/O multiplexing under the hood, which allows a single M (OS thread) to handle many I/O-bound operations concurrently without blocking. The Go runtime uses system calls like select or poll to monitor multiple sockets or file descriptors simultaneously.

Now that we understand how goroutines work and how they are handled, let’s see what happens if we don’t create good goroutines.

Garbage Collection & Goroutines

The Go runtime has an automatic garbage collector (GC) that reclaims memory allocated to unused objects. Go routines can lead to memory leaks if not managed properly (a channel causing a blocked Goroutine). The runtime tracks Go routines, and orphaned Goroutines are eventually cleaned up if they become unreachable.

Performance Considerations

Avoid Goroutine Leaks – Always close channels or use context.WithCancel() to prevent orphaned Go routines

Optimize Channel Usage – Use buffered channels when possible to reduce blocking

Minimize Lock Contention – Use sync.Mutex carefully to avoid performance bottlenecks

Thanks for reading Brain Bytes & Binary! Subscribe for free to receive new posts and support my work.

Key Takeaways

  • Go uses a G-M-P model to efficiently schedule Goroutines.
  • Work stealing & global queue help distribute tasks among processors.
  • Stack splitting ensures memory efficiency and quick Goroutine allocation.
  • Preemptive scheduling prevents long-running Goroutines from starving others.
  • I/O multiplexing enables efficient handling of network calls and system I/O.
  • Proper Goroutine management is necessary to avoid memory leaks.

A lot of us agree on the fact that Golang’s concurrency management via goroutines makes it really simple to write concurrent code, but we don’t realize how complex it has been implemented internally to make it look so simple.

I was fascinated when researching this topic and its intricate details.

That’s all for this week, I enjoyed writing this one. I will come back next week with another interesting one.

Top comments (0)