As Go engineers, we've all internalized the context pattern. When an application shuts down, we propagate the cancellation, our select blocks catch <-ctx.Done(), and we return. Clean. Simple. Idiomatic.
It usually looks like this:
for {
select {
case <-ctx.Done():
return // Clean exit, right?
default:
// Do work
}
}
For many applications, this is "good enough." If you're running a stateless process that can safely restart mid-operation, Kubernetes will just spin up a new POD and life goes on. A few "context canceled" errors in your logs? Who cares.
But what about processing a payment when SIGTERM arrives? Or acknowledging a message you've half-processed?
When "good enough" isn't enough anymore?
Not every application can afford to drop everything and exit. Consider what happens when you're halfway through processing a payment and SIGTERM arrives. Or when you've pulled a message from a queue but haven't acknowledged it yet.
Here are few scenarios where naive context cancellation creates real problems:
Message queues with acknowledgments — SQS, RabbitMQ, Kafka — any system where you must explicitly confirm message processing. Unacknowledged messages get reprocessed, causing duplicates or delays.
Multi-step transactions — Payment captured but not recorded. Inventory reserved but not confirmed. Customer charged but order not created.
Non-idempotent operations — Notifications already sent. Resources already created. Webhooks already fired. You can't just "try again."
Long-running batch jobs — Hours of computation that can't resume from the middle. Data migrations left in inconsistent states.
In all these cases, the cost of interruption exceeds the cost of waiting a few extra seconds for clean completion.
If your code lives in any of these worlds, keep reading.
The Problem: One Signal, Two Jobs
Let's look at a typical message processing pipeline - I'll use SQS as an example since it clearly demonstrates the issue, but the pattern applies to any acknowledge-based system:
Here's the typical implementation:
func ConsumeMessages(ctx context.Context, queueURL string) error {
messages := make(chan Message, 100) // Buffered channel
go pollMessages(ctx, queueURL, messages)
for {
select {
case <-ctx.Done():
// Problem: returns immediately!
// Buffered messages abandoned
// In-flight processing interrupted
return ctx.Err()
case msg := <-messages:
processMessage(ctx, msg) // ← SIGTERM during this
deleteFromSQS(ctx, msg) // ← Never executes
}
}
}
When SIGTERM arrives and ctx.Done() fires, here's what may happen:
func processMessage(ctx context.Context, msg Message) error {
// Step 1: Call external API
if err := callSomeAPI(ctx, msg.Data); err != nil {
return err // ctx.Done() could interrupt here
}
// Step 2: Database save interrupted by ctx.Done()
if err := saveToDatabase(ctx, msg.Data); err != nil {
return err // or here
}
return nil
}
Since we propagate the context to processMessage(), the context cancellation flows through your function calls. The external API call might complete, but the database save fails with context.Canceled. Now you have partial state — the API side effect happened, but nothing was saved. And since acknowledgeMessage() never runs, the message gets reprocessed, potentially duplicating the API call.
Additionally, any messages already fetched from the queue but sitting in a buffer waiting to be processed get abandoned when the loop exits — they'll timeout and reappear for reprocessing.
The root issue: We have one signal (ctx.Done()) trying to do two completely different jobs.
- Stop accepting new work
- Stop working immediately
We need to separate these concerns.
Mental Model: The Restaurant Kitchen
Imagine your system like a restaurant kitchen during closing time:
Orders In → Prep Station → Cooking → Serve to Customer
(fetch) (buffer) (process) (complete)
Naive shutdown (ctx.Done()):
The manager shouts "We're closed!" and the chef drops the pan mid-stir and walks out.
- Orders on the counter? Tossed.
- Food on the stove? Abandoned half-cooked.
- Customers waiting for meals? Kicked out hungry.
Good shutdown (graceful):
- Lock the front door — Stop taking new orders
- Finish what's cooking — Complete dishes on the stove
- Serve completed meals — Deliver to waiting customers
- Then turn off the lights and go home
Of course, you can't wait forever. If it's been 30 minutes and someone's still cooking, you'll eventually have to call it. But you give the kitchen a reasonable chance to finish.
This is exactly our strategy: control the shutdown signal so work can flow to completion.
The Two-Phase Shutdown
The solution is elegantly simple: use two signals instead of one.
Signal 1: "Please stop gracefully" — Stop accepting new work
Signal 2: "Emergency brake" — Force shutdown if graceful fails (via ctx.Done())
Here's the conceptual flow:
// The core idea
select {
case <-ctx.Done():
// Emergency: parent context canceled, stop everything now
return ctx.Err()
case <-stopSignalCh:
// Graceful shutdown in three phases:
// Phase 1: Stop accepting new work
cancelPoller()
// Phase 2: Wait for poller to finish and close the work channel
<-pollerDone
// Phase 3: Processor drains remaining messages from channel
<-processorDone
return nil
}
We're using the simple channel close as a coordination mechanism. When the poller stops, it closes the message channel. The processor, ranging over that channel, naturally drains any remaining messages and then exits. No complex synchronization needed — just Go's built-in channel semantics.
Implementation: A Real Example
Let me show you how this looks in practice. I'll use a real SQS consumer implementation that I built - you can see the full code on GitHub, but I'll walk through the key parts here.
The Data Flow
Before diving into code, let's visualize what we're building:
Single Message Flow:
┌──────┐ ┌─────────┐ ┌─────────┐ ┌────────┐
│ Poll │───▶│Transform│───▶│ Process │───▶│ Delete │
│ SQS │ │ Message │ │Handler │ │ (ACK) │
└──────┘ └─────────┘ └─────────┘ └────────┘
Graceful Shutdown Flow:
-
stopSignalChreceives signal - Poller stops fetching → closes channel
- Processor sees closed channel → drains remaining messages
- All messages acknowledged → Exit
Core Structure
We need to track both the shutdown request and completion:
type Consumer[T any] struct {
poller Poller
processor Processor[T]
stopSignalCh chan struct{} // Trigger graceful shutdown
stoppedCh chan struct{} // Signal completion
// ... config, logging, etc
}
The Main Loop: Control Tower
Here's the heart of the system:
func (c *SQSConsumer[T]) Consume(
ctx context.Context,
queueURL string,
handler Handler[T],
) error {
// Buffer size balances throughput against shutdown time
bufferSize := c.cfg.ProcessorWorkerPoolSize * 3
msgs := make(chan sqstypes.Message, bufferSize)
// Separate contexts let us control shutdown independently
processCtx, cancelProcess := context.WithCancel(ctx)
pollerCtx, cancelPoller := context.WithCancel(ctx)
// Cleanup always happens, whether we exit gracefully or via context cancellation
defer func() {
cancelPoller()
cancelProcess()
c.isRunning = false
c.metrics.RecordShutdown(ctx)
}()
c.isRunning = true
pollerErrCh := make(chan error, 1)
processErrCh := make(chan error, 1)
// Start both components
go func() { pollerErrCh <- c.poller.Poll(pollerCtx, msgs) }()
go func() { processErrCh <- c.processor.Process(processCtx, msgs, handler) }()
// The control tower
select {
case <-ctx.Done():
// Emergency brake—parent context canceled (e.g., timeout expired)
return ctx.Err()
case <-c.stopSignalCh:
// Graceful shutdown sequence
// PHASE 1: Stop accepting new work
cancelPoller()
// PHASE 2: Wait for poller to close the channel
// CRITICAL: The poller MUST close 'msgs' when it exits
if err := <-pollerErrCh; err != nil {
return err
}
// PHASE 3: Processor drains remaining messages
// It exits naturally when the channel closes
if err := <-processErrCh; err != nil {
return err
}
// Signal that shutdown is complete
// This unblocks Close() which is waiting on this channel
close(c.stoppedCh)
return nil
}
}
The key insight: we cancel only the poller's context (closing the door). The processor keeps running until the channel is empty and closed. The processor doesn't know or care that we're shutting down — it just keeps processing until there's no more work.
The Poller: Locking the Front Door
Remember our restaurant analogy? The poller is the host who locks the door. Its job is simple but critical — it must close the channel when it exits:
func (p *Poller) Poll(ctx context.Context, ch chan<- Message) error {
defer close(ch) // This one line coordinates everything
for {
if ctx.Err() != nil {
// Context canceled—time to stop
return nil
}
// Poll for messages
messages := p.fetchMessages(ctx, queueURL)
for _, msg := range messages {
select {
case ch <- msg:
// Message sent to processor
case <-ctx.Done():
// Context canceled while sending
return nil
}
}
}
}
That defer close(ch) does the heavy lifting. When the poller's context is canceled:
- The function returns
- The defer closes the channel
- The processor's range loop exits naturally
No explicit coordination needed.
The Processor: Finishing What's on the Stove
The processor simply ranges over the channel until it closes:
func (p *Processor[T]) Process(ctx context.Context, msgs <-chan Message, handler Handler[T]) error {
var wg sync.WaitGroup
// Start worker pool
for i := 0; i < p.workerPoolSize; i++ {
wg.Add(1)
go func() {
defer wg.Done()
// Process messages until channel closes
for msg := range msgs {
select {
case <-ctx.Done():
// Emergency exit if timeout expires
return
default:
p.processMessage(ctx, msg, handler)
}
}
}()
}
// Wait for all workers to finish
wg.Wait()
return nil
}
The processor keeps the ctx.Done() check in case we hit the timeout and need to force-exit, but during normal graceful shutdown, it simply processes until msgs closes.
The Close Method: Initiating Shutdown
This is the public API your main.go calls when it catches SIGTERM:
func (c *Consumer[T]) Close() error {
c.mu.Lock()
if !c.isRunning || c.isClosing {
c.mu.Unlock()
return nil
}
c.isClosing = true
c.stopSignalCh <- struct{}{} // Trigger graceful shutdown
c.mu.Unlock()
// Wait for completion, but not forever
timeout := time.Duration(c.cfg.GracefulShutdownTimeout) * time.Second
select {
case <-c.stoppedCh:
// Clean shutdown completed
return nil
case <-time.After(timeout):
// Timeout expired—shutdown incomplete
return fmt.Errorf("shutdown timeout after %v", timeout)
}
}
The timeout is crucial. In the real world, you can't wait indefinitely. Kubernetes will eventually send SIGKILL if your POD doesn't terminate. The timeout gives you a chance to clean up, but ensures you don't hang forever.
Choosing a Timeout Value
How do you pick a good timeout?
shutdownTime = (bufferSize / workerPoolSize) × avgProcessingTime × safetyFactor
Example:
- Buffer: 30 messages
- Workers: 10
- Processing time: 100ms per message
- Safety factor: 3x
shutdownTime = (30 / 10) × 100ms × 3 = 900ms
Set timeout to 2-3x this: GracefulShutdownTimeout = 3 seconds
What Happens When The Timeout Expires?
If the timeout expires, Close() returns an error, but the consumer is still running. At this point, you have a few options:
Let it go — Kubernetes will
SIGKILLthe POD shortly.Force cancel — Cancel the parent context to trigger the emergency brake.
In production code, this looks like:
func main() {
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
consumer := NewConsumer(cfg, poller, processor)
// Handle shutdown signals
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
go func() {
<-sigCh
log.Info("Shutdown signal received")
if err := consumer.Close(); err != nil {
// Decision point: what now?
log.Error("some messages may be reprocessed", "error", err)
cancel() // Emergency brake: force everything to stop
}
}()
// This blocks until shutdown completes
if err := consumer.Consume(ctx, handler); err != nil {
log.Error("Consumer error", "error", err)
}
}
Key Takeaways
The problem: ctx.Done() can't distinguish "stop accepting work" from "finish current work." This may leave in-flight work in inconsistent states.
The solution:
- Use separate contexts for different components
- Use a dedicated signal channel for graceful shutdown
- Let channel closure coordinate the drain naturally
- Always include a timeout — you can't wait forever
When to use this pattern:
- Message queue consumers (SQS, RabbitMQ, Kafka)
- Any system with explicit acknowledgments
- Multi-step operations that must complete atomically
- Long-running batch jobs
- Non-idempotent external API calls
When you don't need it:
- Stateless operations
- Idempotent operations that can safely restart
- Pure computational tasks with no side effects
What's Next?
If you're building systems with in-flight work, consider what your "units of work" look like. What state do they maintain? What happens if they're interrupted? How long does completion typically take?
The answers to these questions will guide your shutdown strategy. Sometimes immediate exit is fine. Sometimes you need another approach. The key is making a conscious choice rather than defaulting to ctx.Done() everywhere.
The approach I've shown here is specific to message consumers, but the principle applies broadly.
For the complete implementation check out the full code on GitHub.
And if you've implemented graceful shutdown in a different way, or have real-life stories about what happens when you don't, I'd love to discuss them.

Top comments (0)