Myroslav Vivcharyk

Posted on Jan 13 • Edited on Jan 15 • Originally published at devgeist.com

Go Graceful Shutdown: Beyond ctx.Done()

#go #backend #distributedsystems

As Go engineers, we've all internalized the context pattern. When an application shuts down, we propagate the cancellation, our select blocks catch <-ctx.Done(), and we return. Clean. Simple. Idiomatic.

It usually looks like this:

for {
    select {
    case <-ctx.Done():
        return // Clean exit, right?
    default:
        // Do work
    }
}

For many applications, this is "good enough." If you're running a stateless process that can safely restart mid-operation, Kubernetes will just spin up a new POD and life goes on. A few "context canceled" errors in your logs? Who cares.

But what about processing a payment when SIGTERM arrives? Or acknowledging a message you've half-processed?

When "good enough" isn't enough anymore?

Not every application can afford to drop everything and exit. Consider what happens when you're halfway through processing a payment and SIGTERM arrives. Or when you've pulled a message from a queue but haven't acknowledged it yet.

Here are few scenarios where naive context cancellation creates real problems:

Message queues with acknowledgments — SQS, RabbitMQ, Kafka — any system where you must explicitly confirm message processing. Unacknowledged messages get reprocessed, causing duplicates or delays.

Multi-step transactions — Payment captured but not recorded. Inventory reserved but not confirmed. Customer charged but order not created.

Non-idempotent operations — Notifications already sent. Resources already created. Webhooks already fired. You can't just "try again."

Long-running batch jobs — Hours of computation that can't resume from the middle. Data migrations left in inconsistent states.

In all these cases, the cost of interruption exceeds the cost of waiting a few extra seconds for clean completion.

If your code lives in any of these worlds, keep reading.

The Problem: One Signal, Two Jobs

Let's look at a typical message processing pipeline - I'll use SQS as an example since it clearly demonstrates the issue, but the pattern applies to any acknowledge-based system:

Here's the typical implementation:

func ConsumeMessages(ctx context.Context, queueURL string) error {
    messages := make(chan Message, 100) // Buffered channel

    go pollMessages(ctx, queueURL, messages)

    for {
        select {
        case <-ctx.Done():
            // Problem: returns immediately!
            // Buffered messages abandoned
            // In-flight processing interrupted
            return ctx.Err()

        case msg := <-messages:
            processMessage(ctx, msg)    // ← SIGTERM during this
            deleteFromSQS(ctx, msg)     // ← Never executes
        }
    }
}

When SIGTERM arrives and ctx.Done() fires, here's what may happen:

func processMessage(ctx context.Context, msg Message) error {
    // Step 1: Call external API
    if err := callSomeAPI(ctx, msg.Data); err != nil {
        return err  // ctx.Done() could interrupt here
    }

    // Step 2: Database save interrupted by ctx.Done()
    if err := saveToDatabase(ctx, msg.Data); err != nil {
        return err  // or here
    }

    return nil
}

Since we propagate the context to processMessage(), the context cancellation flows through your function calls. The external API call might complete, but the database save fails with context.Canceled. Now you have partial state — the API side effect happened, but nothing was saved. And since acknowledgeMessage() never runs, the message gets reprocessed, potentially duplicating the API call.

Additionally, any messages already fetched from the queue but sitting in a buffer waiting to be processed get abandoned when the loop exits — they'll timeout and reappear for reprocessing.

The root issue: We have one signal (ctx.Done()) trying to do two completely different jobs.

Stop accepting new work
Stop working immediately

We need to separate these concerns.

Mental Model: The Restaurant Kitchen

Imagine your system like a restaurant kitchen during closing time:

  Orders In  →  Prep Station  →  Cooking  →  Serve to Customer
   (fetch)       (buffer)       (process)     (complete)

Naive shutdown (ctx.Done()):

The manager shouts "We're closed!" and the chef drops the pan mid-stir and walks out.

Orders on the counter? Tossed.
Food on the stove? Abandoned half-cooked.
Customers waiting for meals? Kicked out hungry.

Good shutdown (graceful):

Lock the front door — Stop taking new orders
Finish what's cooking — Complete dishes on the stove
Serve completed meals — Deliver to waiting customers
Then turn off the lights and go home

Of course, you can't wait forever. If it's been 30 minutes and someone's still cooking, you'll eventually have to call it. But you give the kitchen a reasonable chance to finish.

This is exactly our strategy: control the shutdown signal so work can flow to completion.

The Two-Phase Shutdown

The solution is elegantly simple: use two signals instead of one.

Signal 1: "Please stop gracefully" — Stop accepting new work

Signal 2: "Emergency brake" — Force shutdown if graceful fails (via ctx.Done())

Here's the conceptual flow:

// The core idea
select {
case <-ctx.Done():
    // Emergency: parent context canceled, stop everything now
    return ctx.Err()

case <-stopSignalCh:
    // Graceful shutdown in three phases:

    // Phase 1: Stop accepting new work
    cancelPoller()

    // Phase 2: Wait for poller to finish and close the work channel
    <-pollerDone

    // Phase 3: Processor drains remaining messages from channel
    <-processorDone

    return nil
}

We're using the simple channel close as a coordination mechanism. When the poller stops, it closes the message channel. The processor, ranging over that channel, naturally drains any remaining messages and then exits. No complex synchronization needed — just Go's built-in channel semantics.

Implementation: A Real Example

Let me show you how this looks in practice. I'll use a real SQS consumer implementation that I built - you can see the full code on GitHub, but I'll walk through the key parts here.

The Data Flow

Before diving into code, let's visualize what we're building:

Single Message Flow:
┌──────┐    ┌─────────┐    ┌─────────┐    ┌────────┐
│ Poll │───▶│Transform│───▶│ Process │───▶│ Delete │
│ SQS  │    │ Message │    │Handler  │    │  (ACK) │
└──────┘    └─────────┘    └─────────┘    └────────┘

Graceful Shutdown Flow:

stopSignalCh receives signal
Poller stops fetching → closes channel
Processor sees closed channel → drains remaining messages
All messages acknowledged → Exit

Core Structure

We need to track both the shutdown request and completion:

type Consumer[T any] struct {
    poller      Poller
    processor   Processor[T]

    stopSignalCh chan struct{}  // Trigger graceful shutdown
    stoppedCh    chan struct{}  // Signal completion

    // ... config, logging, etc
}

The Main Loop: Control Tower

Here's the heart of the system:

func (c *SQSConsumer[T]) Consume(
    ctx context.Context,
    queueURL string,
    handler Handler[T],
) error {
    // Buffer size balances throughput against shutdown time
    bufferSize := c.cfg.ProcessorWorkerPoolSize * 3
    msgs := make(chan sqstypes.Message, bufferSize)

    // Separate contexts let us control shutdown independently
    processCtx, cancelProcess := context.WithCancel(ctx)
    pollerCtx, cancelPoller := context.WithCancel(ctx)

    // Cleanup always happens, whether we exit gracefully or via context cancellation
    defer func() {
        cancelPoller()
        cancelProcess()
        c.isRunning = false
        c.metrics.RecordShutdown(ctx)
    }()

    c.isRunning = true

    pollerErrCh := make(chan error, 1)
    processErrCh := make(chan error, 1)

    // Start both components
    go func() { pollerErrCh <- c.poller.Poll(pollerCtx, msgs) }()
    go func() { processErrCh <- c.processor.Process(processCtx, msgs, handler) }()

    // The control tower
    select {
    case <-ctx.Done():
        // Emergency brake—parent context canceled (e.g., timeout expired)
        return ctx.Err()

    case <-c.stopSignalCh:
        // Graceful shutdown sequence

        // PHASE 1: Stop accepting new work
        cancelPoller()

        // PHASE 2: Wait for poller to close the channel
        // CRITICAL: The poller MUST close 'msgs' when it exits
        if err := <-pollerErrCh; err != nil {
            return err
        }

        // PHASE 3: Processor drains remaining messages
        // It exits naturally when the channel closes
        if err := <-processErrCh; err != nil {
            return err
        }

        // Signal that shutdown is complete
            // This unblocks Close() which is waiting on this channel
        close(c.stoppedCh)

      return nil
    }
}

The key insight: we cancel only the poller's context (closing the door). The processor keeps running until the channel is empty and closed. The processor doesn't know or care that we're shutting down — it just keeps processing until there's no more work.

The Poller: Locking the Front Door

Remember our restaurant analogy? The poller is the host who locks the door. Its job is simple but critical — it must close the channel when it exits:

func (p *Poller) Poll(ctx context.Context, ch chan<- Message) error {
    defer close(ch)  // This one line coordinates everything

    for {
        if ctx.Err() != nil {
              // Context canceled—time to stop
            return nil
        }

        // Poll for messages
        messages := p.fetchMessages(ctx, queueURL)

        for _, msg := range messages {
            select {
            case ch <- msg:
                // Message sent to processor
            case <-ctx.Done():
                // Context canceled while sending
                return nil
            }
        }
    }
}

That defer close(ch) does the heavy lifting. When the poller's context is canceled:

The function returns
The defer closes the channel
The processor's range loop exits naturally

No explicit coordination needed.

The Processor: Finishing What's on the Stove

The processor simply ranges over the channel until it closes:

func (p *Processor[T]) Process(ctx context.Context, msgs <-chan Message, handler Handler[T]) error {
    var wg sync.WaitGroup

    // Start worker pool
    for i := 0; i < p.workerPoolSize; i++ {
        wg.Add(1)

        go func() {
            defer wg.Done()

            // Process messages until channel closes
            for msg := range msgs {
                select {
                case <-ctx.Done():
                        // Emergency exit if timeout expires
                    return
                default:
                    p.processMessage(ctx, msg, handler)
                }
            }
        }()
    }

    // Wait for all workers to finish
    wg.Wait()

    return nil
}

The processor keeps the ctx.Done() check in case we hit the timeout and need to force-exit, but during normal graceful shutdown, it simply processes until msgs closes.

The Close Method: Initiating Shutdown

This is the public API your main.go calls when it catches SIGTERM:

func (c *Consumer[T]) Close() error {
    c.mu.Lock()

    if !c.isRunning || c.isClosing {
        c.mu.Unlock()
        return nil
    }

    c.isClosing = true
    c.stopSignalCh <- struct{}{}  // Trigger graceful shutdown
    c.mu.Unlock()

    // Wait for completion, but not forever
    timeout := time.Duration(c.cfg.GracefulShutdownTimeout) * time.Second

    select {
    case <-c.stoppedCh:
        // Clean shutdown completed
        return nil

    case <-time.After(timeout):
        // Timeout expired—shutdown incomplete
        return fmt.Errorf("shutdown timeout after %v", timeout)
    }
}

The timeout is crucial. In the real world, you can't wait indefinitely. Kubernetes will eventually send SIGKILL if your POD doesn't terminate. The timeout gives you a chance to clean up, but ensures you don't hang forever.

Choosing a Timeout Value

How do you pick a good timeout?

shutdownTime = (bufferSize / workerPoolSize) × avgProcessingTime × safetyFactor

Example:

Buffer: 30 messages
Workers: 10
Processing time: 100ms per message
Safety factor: 3x

shutdownTime = (30 / 10) × 100ms × 3 = 900ms

Set timeout to 2-3x this: GracefulShutdownTimeout = 3 seconds

What Happens When The Timeout Expires?

If the timeout expires, Close() returns an error, but the consumer is still running. At this point, you have a few options:

Let it go — Kubernetes will SIGKILL the POD shortly.
Force cancel — Cancel the parent context to trigger the emergency brake.

In production code, this looks like:

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    consumer := NewConsumer(cfg, poller, processor)

    // Handle shutdown signals
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)

    go func() {
        <-sigCh
        log.Info("Shutdown signal received")

        if err := consumer.Close(); err != nil {
                // Decision point: what now?          
            log.Error("some messages may be reprocessed", "error", err)
            cancel()  // Emergency brake: force everything to stop
        }
    }()

    // This blocks until shutdown completes
    if err := consumer.Consume(ctx, handler); err != nil {
        log.Error("Consumer error", "error", err)
    }
}

Key Takeaways

The problem: ctx.Done() can't distinguish "stop accepting work" from "finish current work." This may leave in-flight work in inconsistent states.

The solution:

Use separate contexts for different components
Use a dedicated signal channel for graceful shutdown
Let channel closure coordinate the drain naturally
Always include a timeout — you can't wait forever

When to use this pattern:

Message queue consumers (SQS, RabbitMQ, Kafka)
Any system with explicit acknowledgments
Multi-step operations that must complete atomically
Long-running batch jobs
Non-idempotent external API calls

When you don't need it:

Stateless operations
Idempotent operations that can safely restart
Pure computational tasks with no side effects

What's Next?

If you're building systems with in-flight work, consider what your "units of work" look like. What state do they maintain? What happens if they're interrupted? How long does completion typically take?

The answers to these questions will guide your shutdown strategy. Sometimes immediate exit is fine. Sometimes you need another approach. The key is making a conscious choice rather than defaulting to ctx.Done() everywhere.

The approach I've shown here is specific to message consumers, but the principle applies broadly.

For the complete implementation check out the full code on GitHub.

And if you've implemented graceful shutdown in a different way, or have real-life stories about what happens when you don't, I'd love to discuss them.

DEV Community