If you’ve ever tried learning go routines, you’ve probably came across the line “They’re lightweight threads”. But then the questions start coming in:
- “Are they real threads ?”
- “How can go run millions of them ?”
- “What is this GMP thingy ?"
I had the same questions and a lot of confusion when I started learning Go. Goroutines seems magical - almost too good to be true. The more research I did, things started to make some sense.
So I went deeper down the rabbit hole to understand more about how goroutines work along with OS threads. This article is a journey and my attempt to explain how everything works.
Introduction
Let’s first get out basics clear. We need to know about “Processes”, “Threads”, “Context-Switching” and some other basic concepts like stack size ( although they are not that important for the context of this article ).
Process
A process is an independent execution environment created by the operating system.
It consists of:
- A private virtual address space (VAS) mapped by the OS and MMU ( Memory Management Unit )
- Executable code (the program’s text segment)
- A heap for dynamic memory allocation
- One or more stacks, one per thread
- File descriptors referencing kernel-managed resources (files, sockets)
- Environment variables inherited from its parent
- Process control metadata, including PID, scheduling priority, credentials, resource limits, and runtime statistics
Processes provide isolation: each process runs with its own memory mappings and cannot directly access the memory of other processes.
Context switching between processes requires switching address spaces, memory mappings, and other resource structures, which makes process switches relatively expensive.
In simple terms, A process is just a running program with it’s own isolated memory and system resources. Two processes will never interfere with each other’s execution and would stay isolated.
Threads
Now that we understand what a process is, the next piece is understanding threads, because goroutines are build on this concept.
A single process may contain one or many threads, all sharing the same:
- virtual address space
- heap
- global variables
- open file descriptors
- code and libraries
However, each thread has:
- its own stack
- its own program counter (PC)
- its own CPU register set
- its own thread-local storage (TLS)
- its own kernel scheduling metadata
Threads within the same process run independently and may execute concurrently on different CPU cores ( through context switching. We’ll come to this in next section ).
Because threads share memory, they can communicate cheaply — but must also use synchronization mechanisms (mutexes, semaphores) to avoid race conditions.
Race condition is a situation where 2 or more threads try to access the same shared data at the same time and atleast one of them is trying to update/modify it. The final outcome of the data depends on timing of those threads execution. Since the execution timings are unpredictable, results become random and inconsistent everytime we run the program making it difficult to debug the code. That’s why mutexes and semaphores are important to avoid race conditions since they lock the shared data once a thread has access to it. This way other threads don’t have access to that data till the execution of the first thread is finished and the lock is released.
In summary, threads are the kernel’s basic unit of CPU execution inside a process. Or in more simpler words , these are most basic sequence of instructions inside a process which run on cpu core 1 at a time. If you have a Octa core process ( means 8 cores ) then at a time, 8 threads can run simultaneously achieving true parallelism.
Even though switching between threads is cheaper than switching between processes, it’s still not free. Thread context switches involve several steps:
1. Saving/loading CPU registers: Each thread has its own execution state — registers like the program counter, stack pointer, general-purpose registers, etc.
When switching threads, the OS must save the registers of the outgoing thread and restore the registers of the incoming thread, which takes time.
2. Switching stacks: Every thread has its own stack.
A context switch requires switching the stack pointer from one thread’s stack to another’s.
This means the CPU must now begin reading/writing function frames from a completely different memory region. These stack sizes can range from 1Mb to 8 Mbs due to which it takes time in context switching.
3. Possible TLB and cache effects: TLB = Translation Lookaside Buffer, a small high-speed cache that stores recently used virtual-to-physical memory address translations.
Switching threads can cause TLB misses and cache invalidations, which forces the CPU to reload memory mappings or fetch data from slower memory levels, reducing performance.
4. Involvement of the OS scheduler Thread switching requires a tap into the operating system (kernel mode).
The OS scheduler must:
- decide which thread runs next
- update scheduling metadata
- manage states like runnable, waiting, or blocked This kernel-mode transition adds a lot of overhead.
Context Switching
A context switch is the act of the operating system pausing one running thread or process and resuming another.
Because the CPU can run only one thread per core at a time, the OS must rapidly switch between multiple threads to provide concurrency.
When the OS switches from Thread A → Thread B, it must:
- Save the CPU registers (program counter, stack pointer, general-purpose registers, flags) of Thread A
- Save Thread A’s kernel metadata (scheduling state, priority, CPU usage stats)
- Load Thread B’s saved register state
- Switch to Thread B’s stack
- Possibly switch address spaces (if switching between processes)
- Update scheduling queues and bookkeeping (updating all the small pieces of internal data the OS keeps to track the state of each thread or process)
This entire procedure is performed by the OS scheduler and requires switching from user mode to kernel mode, running scheduling logic, then returning back to user mode.
A context switch ensures that each runnable thread gets a fair share of CPU time but comes with performance costs due to register saving, stack switching, TLB/cache effects, and kernel involvement. Following are the most common reasons why context switching is expensive:
Saving & loading registers: The CPU must store all registers of the old thread and restore the registers of the new one — its entire execution state.
Switching stacks: Each thread has its own stack. The CPU has to stop using one thread’s stack and start using another’s.
Kernel involvement: Switching threads requires entering the kernel, updating run queues and priorities, then returning to user mode.
TLB (Translation Lookaside Buffer) effects: Switching between processes requires switching the page table and flushing part of the TLB, slowing down memory access.
Cache disruption: Each thread often works on different memory regions. Switching threads may cause cache misses because the CPU has to load new data from memory.
All this makes context switching far from free, even though modern CPUs and OSes optimize it heavily. This is where goroutines shine !!!
What are Goroutines and how are they different from OS threads ?
Concurrency is one of the Go’s biggest strengths, and goroutines are at the center of it. But in order to appreciate why goroutines are special, we need to understand what they are and how are they different from OS threads.
What are Goroutines ?
A goroutine is a lightweight function that runs independently and concurrently within a go program. More technical definition would be - a goroutine is a user-space execution unit ( or user space threads ) managed entirely by go runtime. It has ->
- Its own stack, starting with a very small size ( ~2KB) compared to OS threads
- The ability to grow or shrink its stack size on demand
- Scheduling performed by the go runtime scheduler, no the OS scheduler
- Extremely low creation and context-switching cost
Goroutines are multiplexed on the OS threads using M:N mapping
M goroutines are mapped to N OS threads. This is why goroutines can scale to hundreds of thousands or even millions in a single process.
How are Goroutines different from OS Threads ?
Although they both represent unit of execution or a sequence of instructions, goroutines and OS threads differ in almost every important way.
- Scheduling
- Threads: scheduled by OS kernel
- Goroutines: schedules by Go runtime ( this is why goroutine switching is way cheaper since it never traps into kernel )
- Stack size
- OS threads: large, fixed-size stacks ( 1MB-8MB each )
- Goroutines: tiny, flexible stacks ( start ~2KB, size is flexible. This is one of the main reasons why go can handle such a massive concurrency )
- Creation Cost:
- Thread creation is expensive since kernel and memory allocation is involved.
- Goroutine creation is extremely cheap ( just user space allocation )
- Context switching cost
- Thread switching involves saving registers, switching stacks, kernel mode transitions, scheduler logic, and cache/TLB effects.
- Goroutine switching is done inside the Go runtime and requires only saving a small amount of state. Since no kernel is involved, creation is very fast.
- Memory & Resource Usage
- Thread consumes megabytes of memory.
- Goroutines consume kilobyts. This allows go programs to use thousands of go routines safely.
- Blocking behaviour:
- A blocking syscall blocks an entire OS thread
- A blocking I/O operation in Go parks the goroutine, not the OS thread. Runtime efficiently reassigns another goroutine to the freed OS thread. In this way OS threads are never sitting idle giving higher performance.
A goroutine is a lightweight, user-space thread scheduled by the Go runtime, while an OS thread is a heavyweight execution unit scheduled by the operating system.
Goroutines are cheaper, faster, and far more scalable than OS threads.
What is Go runtime ?
By now we’ve talked about processes, threads, and goroutines — but there’s a crucial piece sitting between goroutines and the operating system: the Go runtime.
This runtime is what actually makes goroutines possible.
The Go runtime is a mini operating system that runs inside your Go program. More formally , it is a user-space runtime system bundled with every Go program.
It takes care of everything the OS doesn’t handle for you, including:
- running and scheduling goroutines ( GMP model )
- managing memory and garbage collection
- handling timers
- dealing with network and system calls
- growing and shrinking goroutine stacks
- waking goroutines when events happen
When you run a Go program, you’re not just running your code — you’re also running this runtime, which works in the background and keeps the whole concurrency system running smoothly.
The key thing to understand:
Goroutines don’t run on the OS.
They run on top of the Go runtime, and the runtime runs on OS threads.
This is what makes goroutines so efficient.
You can think of go runtime as a middle layer between goroutines and OS threads.
GMP Model, M:N mapping ?
Goroutines don’t run directly on CPU cores, and they aren’t scheduled by the operating system.
Instead, Go uses a custom, high-performance scheduler called the GMP model, which is at the heart of Go’s concurrency design.
Understanding GMP is crucial because it explains how thousands of goroutines can be multiplexed onto a small number of OS threads efficiently.
What is the GMP model ?
G stands for goroutine ( obviously ). In other words, a lightweight, user-space execution unit.
It contains it’s own tiny stack, it’s program counter and rest of the meta data. Note the G cannot run on a CPU on itself.
M stands for Machine ( OS thread ). Basically, it is a real operating system thread. This is what the OS scheduler actually runs on a OS. 1 M can run only 1 G at a time. Switching between G’s happen inside M itself. If the M has to do a context switch, then all the G’s assigned to that M will be parked for time being. Once that M returns back, all the states of previously assigned G’s are loaded back.
P stands for Processor. Note that I am not talking about the actual CPU cores here. Processor represents a logical scheduler token which run a queue of goroutines. Only an M holding P is allowed to execute a Go code.
P1 → M1 → G1, G2, G3...
P2 → M2 → G4, G5...
P3 → M3 → G6...
...
If let’s say G1 is blocked due to some I/O ( api call or file read ), then M1 will park that goroutine ( it’s actually the go runtime which parks the G1 by marking it as waiting. Since go code is running on M1, we just say that M1 parks the G1 ) and picks the next available G in P1’s queue. This way OS thread is always busy avoiding wasting CPU time.
Also note that P’s run queue is managed by go runtime.
M:N mapping ?
Let’s say for example you have 100,000 goroutines and 8 cpu cores ( an octa core processor ), and runtime might create around 10-20 OS threads (M’s).
Then these M’s will get scheduled on the 8 cores by the OS. Each M runs one goroutine at a time. Each P has a queue of goroutines waiting to run. When one goroutine is blocked ( e.g. on a channel or syscall ), then M picks another goroutine from P’s run queue.
So M:N basically means that M number of goroutines and mapped to N number of OS threads. ( Please note that here M represent a NUMBER of goroutines, not OS thread of GMP model. I know… letter convention sucks ).
A simple example: How the go runtime schedules goroutines
This example will touch the surface level of scheduling. I won’t be discussing any scheduling algorithms here.
Let’s say we have one OS thread (M), one processor (P), and three goroutines (G1, G2, G3). When these go routines are started, these are placed in the P’s local run queue.
Step 1: M begins running G1
The OS scheduler picks M1 (an OS thread) to run on a CPU core.
M1 owns P1, which contains the run queue:
[G1, G2, G3].
M1 pops G1 from the front of the queue and begins executing it.
Step 2: G1 hits a blocking point
Let’s say G1 makes a channel receive or waits on a network read.
Because this is Go-managed blocking, the runtime notices that G1 cannot continue.
The go runtime parks G1
It moves G1 to some appropriate wait list (e.g., waiting on a channel or the network poller)
G1 is now blocked, but importantly:
M1 is NOT blocked. The OS thread stays free.
Step 3: M1 picks the next runnable goroutine
Since G1 is parked, M1 looks at P1’s run queue.
Remaining goroutines:
[G2, G3]
M1 selects G2 and begins executing it.
This is a goroutine context switch, done entirely in user space by the runtime (no OS involvement).
It switches:
G1’s PC and stack pointer saved
G2’s PC and stack pointer restored
This is extremely fast.
Step 4: OS preempts the thread (OS-level context switch)
While G2 is running, the OS timer interrupt fires.
The OS scheduler says:
“Time slice over. Let’s run another OS thread.”
M1 is paused
OS loads another OS thread, say M7, onto the CPU
M7 may belong to a totally different program
This is an OS context switch — heavier and more costly.
Inside the paused M1:
G2 is still waiting to resume
P1 is still attached to M1
When the OS eventually puts M1 back onto the CPU, Go continues running G2 from exactly where it left off.
Step 5: G1 becomes unblocked
Suppose the network poller signals that data arrived.
The runtime marks G1 as runnable again
G1 is placed back into P1’s run queue
Queue becomes:
[G3, G1]
Step 6: M1 finishes G2 and picks the next goroutine
When G2 yields or completes, M1 picks the next goroutine in P1’s queue.
Next is G3.
After G3 yields, M1 picks G1, which is runnable again.
Putting it All Together
Here’s what happened:
Go runtime context switched between goroutines (G1 → G2 → G3 → G1)
- Done entirely in user space
- Very cheap
- No OS involvement
** OS scheduler context switched M1 off the CPU**
- OS-level
- Expensive
- Paused whatever goroutine M1 was running
** P’s run queue kept track of which goroutines were runnable**
- Managed by the Go runtime
- M1 always pulled new work from P1
** Blocked goroutines didn’t block the OS thread**
- Thanks to goroutine parking
- OS thread stayed productive, always busy
Why in CPU bound tasks making more go routines doesn’t make any sense?
Goroutines shine when tasks involve waiting (network I/O, disk I/O, timers, channels, etc.), because they allow other goroutines to run while one is blocked.
But in CPU-bound tasks—like computing primes, hashing, compression, physics simulation, image processing—goroutines don’t help beyond a certain point.
Let’s break down the reasons
- There are only N CPU cores, so only N goroutines can run at the same time. If your machine has 8 CPU cores, only 8 threads/G’s can run simultaneously—no matter how many goroutines you spawn. Everything else just waits in queues. So if your CPU can do 8 things at a time, spawning 100,000 CPU-bound goroutines won’t make it faster. It will only add overhead.
- Extra goroutines increase scheduler overhead. Every extra runnable CPU-bound goroutine: must be queued, must be picked up by the scheduler, must eventually run, must involve G→G context switching. When the tasks are CPU-bound, each goroutine never blocks, so the scheduler has fewer opportunities to efficiently switch them. Too many unblocked goroutines = too many unnecessary context switches. This overhead can reduce total throughput.
You don’t need more than GOMAXPROCS goroutines for parallel CPU work. If GOMAXPROCS = 8, spawning 8 CPU-bound goroutines achieves maximum parallelism.
Anything more:
- doesn’t increase speed
- increases scheduling overhead
- increases memory usage
- causes more context switching
- lowers cache locality
So ideal number of goroutines for CPU-bound tasks ≈ number of CPU cores.
Conclusion
Goroutines aren’t just lightweight threads — they’re part of a carefully designed runtime system that makes Go highly scalable. By understanding processes, threads, context switching, and the GMP model, it becomes clear why Go doesn’t rely on OS threads alone.
Goroutines work so well because they use tiny, growable stacks, fast user-space scheduling, and an efficient M:N mapping to OS threads. This lets Go run thousands or even millions of concurrent tasks without the cost of creating thousands of OS threads.
In the end, the message is simple:
OS threads handle parallelism; goroutines enable massive concurrency.
Together, they give Go its power, performance, and simplicity.
This brings us to the end of the blog. I’ve tried my best to explain everything I know in simple terms. AI helped a lot in shaping this article since I’m still improving my writing skills.
Your feedback would mean a lot — it helps me learn and write better.
Top comments (0)