Kuba

Posted on May 15

Concurrency Is Not Parallelism — And Go Knows the Difference

#webdev #go #architecture #backend

I build backend systems for a living. Most of what I ship runs on NestJS — TypeScript, async/await, event-driven patterns, the usual. But over the last year, Go has quietly become my second language, and not by accident. I reached for it specifically because of how it handles concurrency.

This post is about that decision, and more importantly — about what's actually happening under the hood when you write go func(). Because most articles stop at "goroutines are lightweight threads," and that explanation is simultaneously true and almost useless for building mental models that actually help you write better code.

Let's go deeper.

Why Concurrency Matters (and Why Most Developers Underestimate It)

Here's a question worth sitting with: when your server handles 1,000 simultaneous HTTP requests, what is actually happening on the machine?

If you're coming from Node.js, your instinct might be "the event loop handles it — one thread, non-blocking I/O." If you're coming from Java or C#, you might think "a thread pool." If you're coming from Python, you might be somewhere between "GIL hell" and "asyncio."

All of these are concurrency models. They have different tradeoffs, different failure modes, and different mental overhead. Go's model is arguably the most honest about what's actually happening — and I mean that as a genuine compliment.

But before we get into Go, we need to agree on terminology, because this field is plagued by conflation.

Concurrency vs. Parallelism — The Distinction That Actually Matters

Rob Pike (one of Go's creators) gave a talk in 2012 that I consider required viewing for any backend developer: "Concurrency is not parallelism."

The core idea:

Concurrency is about structure. It's the composition of independently executing computations. It's a design property.
Parallelism is about execution. It's the simultaneous execution of multiple computations. It's a runtime property.

A concurrent program can run in parallel, but doesn't have to. A parallel program doesn't have to be well-structured concurrently.

Practical example: imagine you're cooking dinner. You put pasta water on to boil, then chop vegetables while you wait, then check the water, then continue chopping. You're doing one thing at a time — but you've structured the work concurrently. If you had two pairs of hands doing both simultaneously, that would be parallel.

Go lets you express concurrency clearly. Your OS + Go runtime decides whether to execute it in parallel.

Why I Use Go (and Where NestJS Falls Short)

I want to be honest here: NestJS is excellent. TypeScript's type system is genuinely great. The ecosystem is mature. For most CRUD-heavy B2B SaaS work, it's the right tool.

But there are specific scenarios where I've felt the friction:

1. Long-running background processing

In my billing platform (Billy), I have a webhook inbox processor — raw Stripe events land in a table, a scheduled job picks them up, groups them by webhookId, and processes them sequentially per group. In NestJS I'm using BullMQ for this, which means Redis as a dependency, queue configuration, worker threads, retry logic wired manually.

In Go, a similar pattern can be expressed with goroutines and channels in ~50 lines, with zero external dependencies. Not always the right tradeoff — but the simplicity gap is real.

2. CPU-bound work

Node.js has a single-threaded event loop. If you're doing heavy computation — parsing, hashing, data transformation — you block the loop, or you ship it to a worker thread with all the overhead and complexity that entails. Go goroutines can run on multiple OS threads simultaneously. CPU-bound work just... works.

3. CLI tools and infrastructure code

Go compiles to a single static binary. No runtime, no node_modules, no version management hell. For my HookScope CLI (distributed via Homebrew), Go was the only sensible choice. Try distributing a NestJS CLI tool to developers who don't have Node installed.

4. When you want to reason about what's actually happening

This is the subtle one. Go's concurrency model is explicit. When you write go f(), you're saying "run this concurrently." When you communicate via channels, you can see data flow. In Node, the event loop abstraction is powerful but opaque — you often can't tell why something is slow without profiling.

Go's explicitness is sometimes more work. It's also clarity.

The Go Concurrency Model: What's Actually Under the Hood

Okay, here's where we go deep. I want to explain the Go scheduler in a way that gives you real intuition, not just vocabulary.

Goroutines Are Not OS Threads

This is the most important thing to understand.

An OS thread is expensive:

Stack size: typically 1–8 MB by default
Creation time: involves a syscall, kernel context switch
Scheduling: managed by the OS scheduler, which knows nothing about your program

A goroutine is cheap:

Initial stack size: 2–8 KB (grows dynamically as needed, up to 1 GB by default)
Creation time: a few hundred nanoseconds
Scheduling: managed by the Go runtime scheduler, which does know about your program

This is why you can run hundreds of thousands of goroutines without breaking a sweat, while running that many OS threads would immediately destroy your machine.

// This is completely fine in Go
func main() {
    for i := 0; i < 100_000; i++ {
        go func(id int) {
            time.Sleep(10 * time.Second)
            fmt.Printf("goroutine %d done\n", id)
        }(i)
    }
    time.Sleep(15 * time.Second)
}

Trying to do this with OS threads in C or Java? Good luck. You'd hit OS limits around 10,000 threads on most systems, and memory usage would be catastrophic.

The M:N Threading Model

Go uses what's called an M:N threading model: M goroutines are multiplexed across N OS threads.

Goroutines (M):    G1  G2  G3  G4  G5  G6  ...
                    |   |   |    \  |  /
OS Threads (N):    T1  T2  T3   T4
                    |   |   |    |
CPU Cores (P):     C1  C2  C3  C4

The key insight: N (OS threads) typically equals the number of logical CPU cores (controlled by GOMAXPROCS, defaults to runtime.NumCPU()). The runtime maps goroutines to threads dynamically.

The GMP Scheduler

Go's scheduler is built around three entities: G, M, and P.

G — Goroutine

A G represents a goroutine. It contains:

The goroutine's stack
Its current status (running, runnable, waiting, dead)
The program counter (where it is in execution)
A pointer to the function it's running

M — Machine (OS Thread)

An M represents an OS thread. It's the actual execution context provided by the operating system. Each M is associated with at most one P at a time.

When M is doing a blocking syscall (file I/O, network wait), Go detaches it from the P and lets P grab another M to keep executing goroutines. This is crucial — it's how Go avoids stalling the entire program on a single blocking call.

P — Processor

P is the most subtle concept. A P represents a "logical processor" — a context required to execute Go code. Each P has its own local run queue of goroutines waiting to be executed.

The number of Ps = GOMAXPROCS (defaults to CPU core count). This is your true parallelism ceiling: no matter how many goroutines you have, at most GOMAXPROCS goroutines run simultaneously.

P1 [local queue: G3, G7, G11] → executing G1 on M1
P2 [local queue: G4, G8]      → executing G2 on M2
P3 [local queue: G5, G9, G12] → executing G6 on M3
P4 [local queue: G10]         → executing G13 on M4

Global run queue: [G14, G15, G16, ...]

Work Stealing

Here's one of the smartest parts of the Go scheduler: work stealing.

If P1's local queue is empty and P2's queue has goroutines sitting idle, P1 will steal half of P2's goroutines. This happens automatically, without any programmer involvement.

Why does this matter? It means you don't need to manually balance work across goroutines. The scheduler does it. CPU time is efficiently utilized without coordination from your application code.

// You don't need to think about which CPU core handles which work
// The scheduler handles it
results := make(chan int, numWorkers)
for i := 0; i < numWorkers; i++ {
    go func(id int) {
        results <- doWork(id) // Go figures out where this runs
    }(i)
}

Goroutine Preemption — The Evolution

Early Go (pre-1.14) used cooperative scheduling: goroutines would yield only at specific points — function calls, channel operations, syscalls. This meant a goroutine doing a tight CPU-bound loop would starve other goroutines on the same thread:

// Pre-1.14: this could starve other goroutines on the same P
func infiniteLoop() {
    for {
        // No function call, no yield point
        // This goroutine never gives up the thread
    }
}

Go 1.14 introduced asynchronous preemption: the runtime sends a signal to OS threads, which interrupts goroutines at (almost) any point. This is similar to how OS schedulers work, but implemented at the Go runtime level.

The result: no more goroutine starvation from CPU-bound loops. The scheduler is now truly preemptive.

Channels: Communicating Instead of Sharing

Go's philosophy on concurrency comes from Tony Hoare's Communicating Sequential Processes (CSP), formalized in 1978.

The Go proverb that captures it:

"Do not communicate by sharing memory; instead, share memory by communicating."

This is a direct inversion of how most concurrent programming is taught. Traditional approach: you have shared state, you protect it with mutexes. Go's approach: you send data through channels.

What a Channel Actually Is

A channel is a typed conduit. Sending to a channel and receiving from a channel are synchronization points.

ch := make(chan int)    // unbuffered channel
ch := make(chan int, 5) // buffered channel, capacity 5

Unbuffered channel: sender blocks until receiver is ready. Receiver blocks until sender sends. This is a rendezvous — both parties must be present.

Buffered channel: sender blocks only when buffer is full. Receiver blocks only when buffer is empty. Decoupled, but bounded.

Under the hood, a channel is a hchan struct in the Go runtime:

hchan {
    qcount   uint           // number of elements in queue
    dataqsiz uint           // capacity of circular queue
    buf      unsafe.Pointer // pointer to circular queue array
    elemsize uint16
    closed   uint32
    sendq    waitq          // list of blocked senders
    recvq    waitq          // list of blocked receivers
    lock     mutex
}

When you send to a full buffered channel:

Your goroutine is added to sendq
The goroutine is parked (removed from the run queue)
When a receiver drains an element, it checks sendq, wakes the blocked sender, copies the data directly from the sender's stack to the receiver's

This is the key insight: Go minimizes data copying and context switching by letting goroutines park/unpark efficiently at channel boundaries.

Patterns I Actually Use

Fan-out: distribute work across workers

func fanOut(jobs <-chan Job, numWorkers int) <-chan Result {
    results := make(chan Result, numWorkers)

    var wg sync.WaitGroup
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for job := range jobs {
                results <- process(job)
            }
        }()
    }

    go func() {
        wg.Wait()
        close(results)
    }()

    return results
}

Pipeline: chain stages of processing

func pipeline(input <-chan RawEvent) <-chan ProcessedEvent {
    validated := validate(input)     // stage 1
    enriched := enrich(validated)    // stage 2
    return transform(enriched)       // stage 3
}

// Each stage is a goroutine consuming from one channel
// and producing to another. Backpressure is built in.

Cancellation with context

func processWithTimeout(ctx context.Context, id string) error {
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()

    resultCh := make(chan Result, 1)
    go func() {
        resultCh <- doExpensiveWork(id)
    }()

    select {
    case result := <-resultCh:
        return handleResult(result)
    case <-ctx.Done():
        return ctx.Err() // deadline exceeded or cancelled
    }
}

The select statement is Go's concurrency control flow. It blocks until one of the cases is ready. If multiple are ready, it picks one at random (genuinely random — this is specified in the language spec to prevent starvation).

The `sync` Package: When Channels Aren't the Answer

I want to push back on something: channels are not always the right tool.

Go's standard library has sync.Mutex, sync.RWMutex, sync.WaitGroup, sync.Once, and sync/atomic. These exist because sometimes shared state with a mutex is simpler and more appropriate than channels.

Rule of thumb I follow:

Channel: when you're passing ownership of data, or coordinating between goroutines
Mutex: when you're protecting shared state that multiple goroutines need to read/write
atomic: when you need lock-free operations on simple values (counters, flags)

// This is fine — a mutex is clearer here than a channel
type Cache struct {
    mu    sync.RWMutex
    items map[string]Item
}

func (c *Cache) Get(key string) (Item, bool) {
    c.mu.RLock()
    defer c.mu.RUnlock()
    item, ok := c.items[key]
    return item, ok
}

func (c *Cache) Set(key string, item Item) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.items[key] = item
}

Don't force channels where they don't fit. The goal is clarity.

Common Pitfalls I've Hit (So You Don't Have To)

1. Goroutine Leaks

This is the #1 production issue with goroutines. A goroutine that's blocked forever on a channel no one will ever send to — it sits in memory, forever.

// BUG: if no one sends to ch, this goroutine leaks
go func() {
    val := <-ch  // blocks forever if ch is abandoned
    process(val)
}()

Fix: always use context.Context for cancellation, and design your channels so they can be closed or drained. Use goleak in tests to catch leaks.

go func() {
    select {
    case val := <-ch:
        process(val)
    case <-ctx.Done():
        return // clean exit
    }
}()

2. Closing a Channel from the Receiver

Only the sender should close a channel. If the receiver closes it and the sender tries to send, you get a panic.

// WRONG
func receiver(ch chan int) {
    val := <-ch
    close(ch) // dangerous if sender hasn't stopped
}

// RIGHT: sender closes
func sender(ch chan int) {
    defer close(ch)
    for _, v := range data {
        ch <- v
    }
}

3. The Loop Variable Capture Bug

Classic Go footgun (fixed in Go 1.22, but worth knowing for older codebases):

// BUG in Go < 1.22: all goroutines capture the same `i`
for i := 0; i < 5; i++ {
    go func() {
        fmt.Println(i) // likely prints 5 five times
    }()
}

// FIX (pre-1.22): pass as parameter
for i := 0; i < 5; i++ {
    go func(id int) {
        fmt.Println(id) // correct
    }(i)
}

Go 1.22 changed loop variable semantics so each iteration gets its own variable. But if you're maintaining older code, know this bug exists.

4. Unbounded Goroutine Spawning

// BUG: spawns a goroutine per request with no limit
http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
    go doHeavyWork() // what if 10,000 requests hit simultaneously?
})

Fix: use a worker pool with a bounded channel:

workerPool := make(chan struct{}, 100) // max 100 concurrent

http.HandleFunc("/work", func(w http.ResponseWriter, r *http.Request) {
    workerPool <- struct{}{}        // acquire slot (blocks if full)
    go func() {
        defer func() { <-workerPool }() // release slot
        doHeavyWork()
    }()
})

5. Race Conditions — Use the Race Detector

Go ships with a built-in race detector. Use it:

go test -race ./...
go run -race main.go

It adds overhead (~2–20x slowdown), but catches races that would otherwise appear as subtle bugs in production. I run it in CI on every push. Non-negotiable.

Practical Architecture: The Outbox Pattern in Go

Let me show you something concrete. In Billy, I have an outbound webhook system — when things happen, I need to notify customers' endpoints reliably. Classic Outbox Pattern.

Here's how I'd implement the dispatcher in Go:

type OutboxDispatcher struct {
    db      *sql.DB
    client  *http.Client
    workers int
}

func (d *OutboxDispatcher) Run(ctx context.Context) error {
    jobs := make(chan OutboxEvent, d.workers*2)
    results := make(chan dispatchResult, d.workers*2)

    // Start worker pool
    var wg sync.WaitGroup
    for i := 0; i < d.workers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for event := range jobs {
                select {
                case results <- d.dispatch(ctx, event):
                case <-ctx.Done():
                    return
                }
            }
        }()
    }

    // Result processor
    go func() {
        for result := range results {
            if result.err != nil {
                d.markFailed(result.event, result.err)
                continue
            }
            d.markDelivered(result.event)
        }
    }()

    // Polling loop
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            events, err := d.fetchPending(ctx)
            if err != nil {
                log.Printf("fetch error: %v", err)
                continue
            }
            for _, event := range events {
                select {
                case jobs <- event:
                case <-ctx.Done():
                    close(jobs)
                    wg.Wait()
                    close(results)
                    return ctx.Err()
                }
            }
        case <-ctx.Done():
            close(jobs)
            wg.Wait()
            close(results)
            return ctx.Err()
        }
    }
}

What this gives you: bounded concurrency, clean shutdown via context cancellation, backpressure via buffered channels, and zero external dependencies. No Redis, no BullMQ, no queue broker.

Is this appropriate for every project? No. At scale you'd want persistent queues, dead-letter queues, observability. But for a mid-sized SaaS bootstrapped by one developer? The simplicity-to-reliability ratio is hard to beat.

Benchmarks: Why This Model Is Fast

Let me give you some numbers to anchor the intuition.

Context switch cost:

OS thread context switch: ~1–10 microseconds
Goroutine context switch: ~100–300 nanoseconds (~10-50x faster)

Memory per unit:

OS thread: 1–8 MB stack (fixed at creation)
Goroutine: 2–8 KB initial stack (grows as needed)

Maximum concurrent units (typical server):

OS threads: ~10,000 before memory pressure becomes critical
Goroutines: 100,000–1,000,000 routinely

This is why Go HTTP servers can handle massive connection counts without a thread-per-connection model. Each connection gets a goroutine; the runtime handles the rest.

A Go HTTP server handles each connection in a goroutine:

// This is essentially what net/http does internally
func (srv *Server) Serve(l net.Listener) error {
    for {
        conn, err := l.Accept()
        if err != nil { return err }
        go srv.handleConn(conn) // one goroutine per connection
    }
}

With OS threads, this model breaks at ~10K connections. With goroutines, it's standard practice at 100K+.

Where Go Falls Short (Being Honest)

No language post should end without the limitations.

Generics are still maturing. Go 1.18 added generics, but the ecosystem adoption is uneven. Type inference has gaps. Error messages with generic types can be cryptic.

Error handling is verbose. The if err != nil pattern gets tiresome. There's no ? operator like Rust, no .unwrap_or. You write a lot of error-checking boilerplate. Go proposals for better error handling have been discussed for years without resolution.

No inheritance, limited OOP patterns. Go has interfaces and embedding, not inheritance. Coming from TypeScript/Java, this requires a genuine mental shift. Sometimes it's liberating; sometimes you want to express something naturally hierarchical and Go makes you work for it.

The ecosystem is younger. npm has millions of packages. Go's ecosystem is smaller. For niche integrations or specific tooling, you might be writing it yourself.

Debugging goroutine dumps is rough. When your program crashes with 10,000 goroutines, the stack trace is biblical. Tools exist (pprof, trace), but the learning curve is real.

Summary: The Mental Model That Actually Helps

Here's what I want you to take away:

Goroutines are not threads. They're lightweight, managed by the Go runtime, and can number in the hundreds of thousands.
The GMP scheduler is the engine. G (goroutines), M (OS threads), P (logical processors). P count = true parallelism. Work stealing keeps CPUs fed without programmer effort.
Channels are about coordination, not just communication. The blocking semantics of unbuffered channels are a synchronization primitive. Use them to express when goroutines should rendezvous.
Channels are not always the answer. sync.Mutex exists for a reason. Choose the tool that makes the code clearest.
Goroutine leaks are the #1 production bug. Always think about how goroutines exit. Always use context cancellation. Always run with -race.
The simplicity is intentional. Go has very few concurrency primitives — goroutines, channels, select, sync. This is not a limitation. It's a design philosophy: small surface area, deep composability.

For the kind of backend work I do — billing systems, webhook processors, CLI tools, API servers — Go's concurrency model is genuinely better suited than anything I've used in Node.js. Not because Node is bad, but because Go's model is explicit, performant, and scales from a single goroutine to a distributed system with the same mental model.

If you're building backend systems and you haven't spent serious time with Go's concurrency, I think it's worth your time. Not to replace your current stack, but to expand how you think about structure, scheduling, and what's possible.

Building Billy (B2B billing SaaS), HookScope (webhook inspector), and a few other things publicly on X (@kubabuilds). Posts about architecture, Go, and the reality of solo SaaS development.

DEV Community

Concurrency Is Not Parallelism — And Go Knows the Difference

Why Concurrency Matters (and Why Most Developers Underestimate It)

Concurrency vs. Parallelism — The Distinction That Actually Matters

Why I Use Go (and Where NestJS Falls Short)

1. Long-running background processing

2. CPU-bound work

3. CLI tools and infrastructure code

4. When you want to reason about what's actually happening

The Go Concurrency Model: What's Actually Under the Hood

Goroutines Are Not OS Threads

The M:N Threading Model

The GMP Scheduler

G — Goroutine

M — Machine (OS Thread)

P — Processor

Work Stealing

Goroutine Preemption — The Evolution

Channels: Communicating Instead of Sharing

What a Channel Actually Is

Patterns I Actually Use

The `sync` Package: When Channels Aren't the Answer

Common Pitfalls I've Hit (So You Don't Have To)

1. Goroutine Leaks

2. Closing a Channel from the Receiver

3. The Loop Variable Capture Bug

4. Unbounded Goroutine Spawning

5. Race Conditions — Use the Race Detector

Practical Architecture: The Outbox Pattern in Go

Benchmarks: Why This Model Is Fast

Where Go Falls Short (Being Honest)

Summary: The Mental Model That Actually Helps

Top comments (0)

Why Concurrency Matters (and Why Most Developers Underestimate It)

Concurrency vs. Parallelism — The Distinction That Actually Matters

Why I Use Go (and Where NestJS Falls Short)

1. Long-running background processing

2. CPU-bound work

3. CLI tools and infrastructure code

4. When you want to reason about what's actually happening

The Go Concurrency Model: What's Actually Under the Hood

Goroutines Are Not OS Threads

The M:N Threading Model

The GMP Scheduler

G — Goroutine

M — Machine (OS Thread)

P — Processor

Work Stealing

Goroutine Preemption — The Evolution

Channels: Communicating Instead of Sharing

What a Channel Actually Is

Patterns I Actually Use

The sync Package: When Channels Aren't the Answer

Common Pitfalls I've Hit (So You Don't Have To)

1. Goroutine Leaks

2. Closing a Channel from the Receiver

3. The Loop Variable Capture Bug

4. Unbounded Goroutine Spawning

5. Race Conditions — Use the Race Detector

Practical Architecture: The Outbox Pattern in Go

Benchmarks: Why This Model Is Fast

Where Go Falls Short (Being Honest)

Summary: The Mental Model That Actually Helps

The `sync` Package: When Channels Aren't the Answer