It's 3am. The Kafka consumer pod that's been running cleanly for six weeks gets OOM-killed. Kubernetes restarts it. Five minutes later: OOM-killed again. Restart. OOM-killed a third time. By the fourth restart I've shelved the dashboard and started reading runtime/chan.go.
The code that died fit on one line:
events := make(chan Event)
I want to tell you that line is the bug. It isn't. An unbuffered channel will happily backpressure a single producer — every send rendezvous with a receiver, the producer cannot run ahead. The channel did exactly what it was designed to do.
What I had built around it didn't. The Kafka consumer loop wrapped events <- parseEvent(msg) inside a go func(msg) { ... }(msg), spawning a fresh goroutine per inbound message. Every one of those goroutines blocked on send, parked on the channel's sendq list, and kept its stack and the parsed event alive in memory. The channel was the gravestone. The unbounded go func fan-out was what filled it.
This is the story of what a Go channel actually is at the runtime level, why "channels are message passing" is one of the most expensive lies in the Go ecosystem, and why the most common channel bug isn't in the channel — it's in the code that calls into it.
tl;dr — A Go channel is not a queue and not a message bus. It's a heap-allocated
hchanstruct containing a mutex, a ring buffer, and two parked-goroutine lists. The send operation is amemcpyunder a lock, not a transmission. Channels only deliver backpressure if the producer side is bounded. The OOM that started this story came not frommake(chan Event)— that was working as designed — but from an unboundedgo func(msg)fan-out parking thousands of goroutines onsendq, each retaining a 10KB payload. The fix isn't a buffer size. It's making backpressure part of the producer contract: a single long-lived producer withselect-based backoff, plus a bounded queue as a safety net. The same architectural mistake shows up at every layer where engineers reach for an "in-process queue" — including the inbound queue of your AI agent.
The Mental Model That Killed The Pod
Here is what I thought a channel did, and I suspect most Go engineers carry some version of this picture:
"A channel is like a Kafka topic in-process. Producers push messages onto it. Consumers pull messages off it. The runtime handles ordering and delivery. It's CSP — Communicating Sequential Processes — Hoare's thing, basically a typed pipe."
Every word of that sentence is wrong in a way that matters. There is no topic. Nothing is being pushed anywhere. The runtime is not a broker. The word passing — borrowed from message-passing concurrency, where independent processes communicate across an isolation boundary — is the most misleading part. In a Go channel, there is no isolation boundary. There is one struct on the heap, and both goroutines reach in and mutate it.
I held the message-passing model long enough that when the Kafka consumer started ingesting a 12-hour upstream replay at full throttle, I had no instinct that the messages were going somewhere bounded. They weren't. They were sitting in a ring buffer that I had failed to give a size to.
What A Channel Actually Is
Crack open runtime/chan.go in the Go source tree and you'll find this (layout stable since Go 1.7, confirmed against Go 1.21–1.25):
type hchan struct {
qcount uint // total data in the queue
dataqsiz uint // size of the circular queue
buf unsafe.Pointer // points to an array of dataqsiz elements
elemsize uint16
closed uint32
elemtype *_type
sendx uint // send index
recvx uint // receive index
recvq waitq // list of recv waiters
sendq waitq // list of send waiters
lock mutex
}
That's it. That's the channel. A struct with a mutex, a pointer to a circular byte array, two indices to track read/write positions in the ring, and two intrusive linked lists holding parked goroutines that are waiting to send or receive.
When you write ch <- value, the runtime calls chansend, which does roughly this:
-
Take the lock (
lock(&c.lock)). -
Check
recvq— is there a goroutine already parked waiting to receive? If yes, copyvaluedirectly from the sender's stack into the receiver's stack viasendDirect, mark the receiver runnable withgoready, release the lock, return. No buffer involved — when a receiver is already waiting, send can hand off directly without ever touching the ring buffer. (In normal operation a buffered channel can't simultaneously have queued data AND parked receivers; ifrecvqhas a waiter, the buffer is empty.) -
Otherwise, check buffer space — if
qcount < dataqsiz, copyvalueintobuf[sendx], advancesendx, incrementqcount, release the lock, return. -
Otherwise, park the sender — append the sender's goroutine to
sendq, release the lock, and callgoparkto suspend execution until a receiver wakes it up.
Receive is the mirror image, calling chanrecv with sendq and recvq swapped.
Here is the shape of it:
Three things are worth burning into memory:
One — there is no transport. The "message" never leaves the heap. Sender writes bytes; receiver reads bytes; the lock arbitrates. This is shared-memory synchronization with the appearance of message passing.
Two — the buffer is just a ring of typed slots. dataqsiz is set exactly once, at make(chan T, N) time. If you write make(chan T), dataqsiz is zero and there is no buffer at all — every send must rendezvous with a receiver or park.
Three — sendq is unbounded. This is the part nobody talks about. The ring buffer has a fixed size. The list of parked senders waiting to write into the ring buffer does not. If a thousand goroutines all hit a full channel, the runtime parks all thousand of them on sendq and each one keeps its stack and any data it was about to send alive in memory.
That third point is what made the OOM I had a different shape from the one I was about to describe.
The Incident, Mechanism By Mechanism
The pod that died had a goroutine topology that looked like this — and the bug is not the make(chan Event) line. Watch the outer loop:
events := make(chan Event)
// Consumer — slow.
go func() {
for ev := range events {
process(ev) // ~3ms per event
}
}()
// THE ACTUAL BUG: outer loop spawns a fresh goroutine per inbound message.
for msg := range kafkaConsumer.Messages() {
go func(msg kafka.Message) {
events <- parseEvent(msg) // every blocked send parks on sendq
}(msg)
}
If you replace the inner go func(msg) { ... }(msg) with a direct events <- parseEvent(msg), the outer loop itself becomes the producer, and the unbuffered channel correctly backpressures it — the loop simply doesn't advance until the consumer is ready. No OOM.
But because each message is dispatched to a fresh helper goroutine, the outer loop never blocks. It keeps spawning. Each helper goroutine reaches the send, finds no waiting receiver, and parks on sendq. Now sendq is the unbounded thing. Here is what actually happened, in order:
1. Sustained baseline: rendezvous works
At 1K msg/sec inbound and ~3ms per process call (~333/sec consumer throughput), the consumer is already behind by 3x at steady state. For weeks this didn't OOM because the Kafka client's own internal buffer absorbed the gap, and lag built up on the broker side — visible in Grafana, ignored by me.
2. Replay: the producer detaches from the consumer's pace
When upstream re-emitted 12 hours of events, the Kafka client's internal pre-fetch buffer filled to capacity (default fetch.message.max.bytes × partition count = several hundred MB) and started backing up Kafka-side without applying backpressure to the consumer goroutine, because the client library was configured with a large internal queue.
3. The actual heap growth: parked sender goroutines
Each call to events <- parseEvent(msg) on the unbuffered channel would either rendezvous (rare during replay) or park. When it parked, the sender goroutine held:
- Its own stack (~8KB minimum, grew under load)
- The
Eventvalue it was about to send (~10KB per event with strings, headers, payload) - A reference into the Kafka message it was parsing (another ~10KB)
Multiply by the number of in-flight parsing goroutines — which kept being spawned by an outer loop that didn't apply backpressure to itself — and you arrive at the 12GB heap. The channel's sendq was the proximate memory sink, not the buffer (which was zero-sized).
The goroutine lifecycle for each parsing goroutine looked like this:
Every goroutine sitting in Parked_on_sendq is reachable (it's on the runtime's wait queue, which is rooted in the hchan struct, which is rooted by both the producer and consumer goroutines). Reachable means non-collectible. The longer the consumer falls behind, the longer the queue grows.
4. GC can't help
Go's GC can only reclaim unreachable memory. Every parked goroutine on sendq is reachable (it's on the runtime's scheduler queue). Every Event it's holding is reachable. The GC ran, found nothing to free, and the heap continued growing until the kernel OOM-killer fired.
5. The cgroup hammer drops
cgroup memory limit was 4GB. Heap crossed 4GB. OOM kill. Kubernetes restarted the pod. The replay was still in progress on the broker side, so the same sequence ran again. And again.
What this looks like in pprof
You don't have to take my word for the mechanism — it reproduces in under a minute. I built a minimal demo at harrison001/channels-oom-demo (cmd/bug) that runs the same workload shape on a laptop. The output of the bug version over 22 seconds, captured with runtime.NumGoroutine() and runtime.MemStats.HeapAlloc:
t= 1s goroutines= 497 heap_alloc= 5 MB
t= 5s goroutines= 2462 heap_alloc= 28 MB
t= 10s goroutines= 4915 heap_alloc= 61 MB
t= 15s goroutines= 7369 heap_alloc= 89 MB
t= 20s goroutines= 9828 heap_alloc= 109 MB
t= 22s goroutines= 10813 heap_alloc= 125 MB
Goroutine count grows at almost exactly 1 per millisecond (the spawn rate). Heap grows at ~5MB/sec, dominated by the 10KB Event payload each parked goroutine is holding. Extrapolate to a 12-hour replay at production volume and you arrive at the original 12GB OOM.
For comparison, the fix version (cmd/fix) on the same workload:
t= 1s goroutines= 3 heap_alloc= 3 MB chan_len= 256
t= 10s goroutines= 3 heap_alloc= 4 MB chan_len= 256
t= 20s goroutines= 3 heap_alloc= 5 MB chan_len= 256
Three goroutines (producer, consumer, pprof listener). Heap flat at 4-5 MB. Channel pinned at its 256-slot bound, meaning the producer is constantly blocked on send and applying backpressure upstream — exactly what we want.
The Fix, And Why It Works
The visible code change was one parameter. The real fix was making backpressure part of the producer contract — two changes, working together:
events := make(chan Event, 256) // (1) bounded queue as safety net
// (2) single long-lived producer goroutine with select-based backoff —
// NO outer loop spawning fresh goroutines per message.
go func() {
for msg := range kafkaConsumer.Messages() {
select {
case events <- parseEvent(msg):
// sent — loop continues at consumer speed when channel fills
case <-ctx.Done():
return
}
}
}()
The key word in change (2) is single. There is exactly one goroutine reading from Kafka and writing to the channel. When the channel fills, that goroutine blocks on send; the for msg := range loop stops calling Poll(); the Kafka client's internal pre-fetch queue stops draining; consumer lag accumulates broker-side; the broker simply retains messages until we come back. No go func(msg) helpers. Nothing piling up on sendq. Memory stays bounded because the producer is bounded — the buffer is only a safety net to absorb short bursts.
What this changes, mechanically:
Before (unbounded go func fan-out + make(chan Event)) |
After (single producer + make(chan Event, 256)) |
|---|---|
| One goroutine per inbound message | One long-lived producer goroutine |
sendq grows unboundedly with parked helpers |
sendq empty by construction; producer is sole sender |
| No signal to upstream — outer loop never blocks | Producer blocks on send; outer loop runs at consumer speed |
| Kafka client keeps pre-fetching, lag invisible | Kafka client's internal queue fills, consumer stops polling, broker-side lag accumulates |
| OOM | Bounded heap, bounded latency, Kafka rebalances cleanly when behind |
A bounded channel buffer alone does not prevent this OOM. If you applied change (1) without change (2), you'd merely increase the OOM-killing rate — the outer go func(msg) fan-out would keep spawning, the buffer would fill in milliseconds, helpers would pile up on sendq exactly as before. Backpressure is not a property of any one component — it is a property of the entire chain having no unbounded buffer (and no unbounded fan-out) anywhere in it.
Every link in this chain is bounded — the database has connection pool limits, the consumer is rate-limited by process() latency, the channel buffer is 256, the Kafka client's internal queue has a configured max, and the broker simply retains messages on disk when its consumer falls behind. When ANY downstream link slows, the pressure propagates back up by the consumer ceasing to pull; the broker doesn't need to be told anything. The whole system runs at the rate of its slowest component.
If any link in that chain has an unbounded buffer, the chain has no backpressure. That link will absorb the load until it OOMs.
Bounded Buffers Are Not About Channels
The lesson is not "use buffered channels." The lesson is:
Any in-process queue without a capacity bound is a latent OOM.
This applies identically across runtimes:
| Runtime | The footgun | The fix |
|---|---|---|
| Go | Unbounded goroutine fan-out parked on sends (go func(msg) { ch <- ... }(msg)); oversized buffered channels |
Single long-lived producer + select + bounded buffer as safety net |
| Rust (Tokio) | mpsc::unbounded_channel() |
mpsc::channel(N) |
| Python (asyncio) |
asyncio.Queue() with no maxsize
|
asyncio.Queue(maxsize=N) |
| Node.js | Unbounded array of in-flight Promises |
p-limit, Sema, or explicit pool |
| Erlang/Elixir | Process mailbox grows unboundedly when selective receive can't keep up | Demand-driven flow control: GenStage / Flow for pipelines, or explicit ack-based protocols in gen_statem
|
Every one of these reaches for the same shape — an in-process queue — and every one of them OOMs the same way when the shape is unbounded.
When Channels Are The Right Tool
I want to be careful not to overcorrect. Channels are not a mistake. They are an excellent primitive used incorrectly. Cases where reaching for a channel is the right call:
-
Cancellation signaling —
context.Done()is a<-chan struct{}. This is canonical. - Fan-out work distribution with a worker pool — a bounded channel feeding N worker goroutines is a clean semaphore. Buffer size = pool size or small multiple of it.
- Producer-consumer with a known throughput ratio — yes, with a bounded buffer sized to the latency budget.
- Error aggregation from concurrent goroutines — small buffered channel, drain on goroutine completion.
- Handoff between pipeline stages — bounded, with explicit close semantics on the upstream stage.
Cases where reaching for a channel is the wrong call:
- Cross-process messaging — use a real broker (NATS, Kafka, Redis Streams). Channels do not survive a pod restart.
- Persistence — channels are stack-local-ish. If your pod dies, the in-flight data is gone. If you need "at least once" across restarts, you need a real queue.
- Bursty load with unknown shape — if you cannot put a meaningful upper bound on the buffer, you have not understood the load. Adding a channel does not give you understanding; it postpones the OOM.
- Anything that wants to be a message bus — that's not a channel. That's a message bus. They are different categories of system.
The Same Bug, Different Layer: AI Agent Inbound Queues
The reason this post lives in the SecurityLab track and not just "Go tips" is that the exact same mistake is now happening, at scale, in LLM agent infrastructure. I've seen the pattern repeatedly in recent AI backends — same architectural shape, different runtime.
The pattern: an agent backend exposes an HTTP endpoint. Each inbound request is dispatched to a worker pool via an in-process queue.
# The bug, in a different language
request_queue = asyncio.Queue() # unbounded
async def http_handler(req):
await request_queue.put(req) # never blocks
return {"status": "queued"}
async def worker():
while True:
req = await request_queue.get()
await llm_call(req) # 8 seconds, sometimes 30
Steady state is fine: requests arrive faster than they're processed, queue grows slowly, latency creeps up, nobody notices because the HTTP layer keeps returning 200.
Then a launch happens. Or a viral tweet. Or a marketing email goes out. Inbound rate spikes 50x for 20 minutes. The queue accepts everything (it's unbounded). The worker pool can't keep up — LLM calls are inelastic, you can't parallelize past your token-per-minute quota. The queue grows to 200K items. Each item holds a request payload (~50KB with conversation history) and a future. 10GB of heap. OOM. Pod restart. All 200K requests lost. Users see 500s instead of the explicit "rate-limited, try again in 30s" they would have seen with proper backpressure.
The fix is identical to the Go fix:
request_queue = asyncio.Queue(maxsize=100)
async def http_handler(req):
try:
request_queue.put_nowait(req)
except asyncio.QueueFull:
return Response(status=503, headers={"Retry-After": "30"})
return {"status": "queued"}
503 is a feature. It is the system telling the client we're at capacity, retry in 30 seconds. It is honest. It is bounded. It is the difference between a system that degrades gracefully and one that dies silently.
Reproducing This Yourself
The numbers in this post come from a minimal Go program that fits in under 100 lines per command. The repo lives at:
git clone https://github.com/harrison001/channels-oom-demo.git
cd channels-oom-demo
# Watch goroutine count + heap climb every second
go run ./cmd/bug
# Switch to the fix — flat at 3 goroutines, 5 MB heap
go run ./cmd/fix
Each program exposes pprof on localhost:6060. While the bug version is running:
# Confirm 10K+ goroutines parked on chansend → runtime_chanrecv1
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=1' | head -20
# Confirm the heap is dominated by Event payloads, not the channel itself
go tool pprof -text http://localhost:6060/debug/pprof/heap
The bug demo has a hard cap at 20,000 goroutines so it won't actually OOM your laptop. Remove the cap if you want to see the kernel finish the job.
What I Wish I'd Known
If I could send one note back to myself eighteen months before the OOM:
When you reach for an in-process queue, you are choosing a backpressure boundary. The buffer size is not a performance tuning knob. It is a contract: under sustained load greater than my consumer's throughput, this is how much memory I am willing to lose before I tell the producer to stop. If you don't pick a number, the runtime picks one for you, and the number is whatever fits in RAM right before the kernel kills the process.
Channels in Go look like message-passing because the syntax was deliberately borrowed from CSP, a model where independent processes communicate by passing values across an isolation boundary. In Go there is no isolation boundary. The channel is a struct in shared memory, the goroutines are coroutines on the same scheduler, and the entire setup is synchronization plumbing in CSP clothing.
Once you see the hchan struct, you can't un-see it. Every channel decision after that is a synchronization decision, not a transport decision. And synchronization decisions always have a capacity bound — you just have to choose whether to pick it explicitly or have the OOM-killer pick it for you.
Keep going
-
Code:
harrison001/channels-oom-demo— reproduce both versions, capture your own pprof - Next piece: Goroutines Are Cheap — Until Backpressure Is Missing — coming next. The producer side of the same mistake: why "just spawn a goroutine" is the second half of the bug.
- Subscribe: I write one of these monthly on runtime mechanics, distributed systems postmortems, and the security implications of getting them wrong. Newsletter · SecurityLab track
If you've hit this bug — or its cousin in a different runtime — I'd genuinely like to hear about it. The Erlang and Node.js shapes especially: I have hunches but not enough scars. Reply to the newsletter or open an issue on the demo repo.
Top comments (0)