DEV Community

James Lee
James Lee

Posted on

client-go Deep Dive: WorkQueue — The Reliable Task Queue for Kubernetes Controllers

In the previous article we saw how Indexer stores and indexes resource objects for fast local queries. Now we look at the other half of the controller pattern: WorkQueue — the component that bridges event callbacks and controller reconciliation workers.

Source: k8s.io/client-go/util/workqueue/


1. Where WorkQueue Fits

informer.AddEventHandler(ResourceEventHandlerFuncs{
    AddFunc:    func(obj) { queue.Add(keyOf(obj)) },
    UpdateFunc: func(old, new) { queue.Add(keyOf(new)) },
    DeleteFunc: func(obj) { queue.Add(keyOf(obj)) },
})
      
        event fires  key enqueued
      
┌─────────────────────────────────────────────────┐
                  WorkQueue                      
  - deduplicates keys                            
  - rate limits re-enqueues on error             
  - supports multiple concurrent workers         
└─────────────────────────────────────────────────┘
      
        worker goroutine calls queue.Get()
      
Controller reconciliation logic
├── success  queue.Done(key)
└── error    queue.AddRateLimited(key)   retry with backoff
Enter fullscreen mode Exit fullscreen mode

Key pattern: Event handlers only enqueue the key (e.g. "default/nginx"), never the full object. The worker fetches the latest object from Indexer at processing time — ensuring it always works with the most current state.


2. WorkQueue vs Plain FIFO

WorkQueue is not a simple queue. It adds critical properties that make it production-safe for controller development:

Property Plain FIFO WorkQueue
Ordered (FIFO)
Concurrent producers & consumers
Deduplication
Rate limiting on retry
Prometheus metrics
Graceful shutdown (SIGTERM)

Deduplication in detail

Event burst: Pod "nginx" Updated 5 times in 100ms

Plain FIFO:   [nginx, nginx, nginx, nginx, nginx]  ← processed 5 times
WorkQueue:    [nginx]                               ← processed once  ✅

Rule: if a key is already in the queue (not yet processed),
      subsequent Add() calls for the same key are silently dropped.
      The key is only re-eligible after Done() is called.
Enter fullscreen mode Exit fullscreen mode

3. Three Queue Types

client-go ships three queue implementations, each building on the previous:

┌─────────────────────────────────────────────────────────┐
│  Type 1: FIFO Queue                                     │
│  k8s.io/client-go/util/workqueue/queue.go               │
│  Basic ordered queue with deduplication                 │
└─────────────────────────────┬───────────────────────────┘
                              │ extends
┌─────────────────────────────▼───────────────────────────┐
│  Type 2: Delaying Queue                                 │
│  k8s.io/client-go/util/workqueue/delaying_queue.go      │
│  Adds: AddAfter(item, duration) — insert after a delay  │
└─────────────────────────────┬───────────────────────────┘
                              │ extends
┌─────────────────────────────▼───────────────────────────┐
│  Type 3: RateLimiting Queue  ← used in custom controllers│
│  k8s.io/client-go/util/workqueue/rate_limiting_queue.go  │
│  Adds: AddRateLimited, Forget, NumRequeues               │
│  Uses a RateLimiter to compute delay before re-insert    │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

In practice, custom controllers always use the RateLimiting Queue.


4. The RateLimiting Queue Interface

// The rate limiter: decides HOW LONG to delay a re-enqueue
type RateLimiter interface {
    When(item interface{}) time.Duration  // return delay for this item
    Forget(item interface{})              // reset the item's failure history
    NumRequeues(item interface{}) int     // how many times has this item failed
}

// The queue interface: wraps DelayingInterface + rate limiting methods
type RateLimitingInterface interface {
    DelayingInterface                     // inherits Add, Get, Done, Len, etc.
    AddRateLimited(item interface{})      // enqueue with rate-limited delay
    Forget(item interface{})              // clear failure history for this item
    NumRequeues(item interface{}) int     // query failure count
}
Enter fullscreen mode Exit fullscreen mode

Typical controller worker pattern

func (c *Controller) runWorker() {
    for c.processNextItem() {}
}

func (c *Controller) processNextItem() bool {
    // Block until an item is available
    key, quit := c.queue.Get()
    if quit {
        return false
    }
    defer c.queue.Done(key)   // MUST call Done() when finished

    err := c.reconcile(key.(string))
    if err == nil {
        // Success: clear the failure history
        c.queue.Forget(key)
        return true
    }

    if c.queue.NumRequeues(key) < maxRetries {
        // Transient error: re-enqueue with rate-limited delay
        c.queue.AddRateLimited(key)
        return true
    }

    // Too many retries: give up and clear history
    c.queue.Forget(key)
    utilruntime.HandleError(err)
    return true
}
Enter fullscreen mode Exit fullscreen mode

5. Three Rate-Limiting Algorithms

Algorithm 1: Exponential Backoff (ItemExponentialFailureRateLimiter)

When the same item fails repeatedly, the delay grows exponentially with each failure.

Failure count → delay:
1st failure:  base delay × 2^0  =  5ms
2nd failure:  base delay × 2^1  = 10ms
3rd failure:  base delay × 2^2  = 20ms
4th failure:  base delay × 2^3  = 40ms
...
nth failure:  min(base × 2^(n-1), maxDelay)   ← capped at maxDelay
Enter fullscreen mode Exit fullscreen mode
// Default values used in most controllers
workqueue.NewItemExponentialFailureRateLimiter(
    5*time.Millisecond,   // base delay
    1000*time.Second,     // max delay cap
)
Enter fullscreen mode Exit fullscreen mode

Rate-limiting period: starts at AddRateLimited(), ends at Forget().

Timeline:
t=0ms    AddRateLimited("nginx")  → delay 5ms  → enqueued at t=5ms
t=5ms    Get("nginx") → reconcile → fails
t=5ms    AddRateLimited("nginx")  → delay 10ms → enqueued at t=15ms
t=15ms   Get("nginx") → reconcile → fails
t=15ms   AddRateLimited("nginx")  → delay 20ms → enqueued at t=35ms
...
t=Xms    Get("nginx") → reconcile → SUCCESS
         Forget("nginx")          → failure count reset to 0
Enter fullscreen mode Exit fullscreen mode

Real-world example: Kubernetes Pod restart policy Always uses exponential backoff — that's why a crash-looping Pod shows increasing restart intervals (the familiar CrashLoopBackOff state).


Algorithm 2: Token Bucket (BucketRateLimiter)

The token bucket algorithm controls the overall throughput of the queue, regardless of which specific items are being enqueued.

┌─────────────────────────────────────────────────────────┐
│                    Token Bucket                         │
│                                                         │
│  ┌──┬──┬──┬──┬──┐                                       │
│  │🪙│🪙│🪙│🪙│🪙│  ← bucket (fixed capacity)            │
│  └──┴──┴──┴──┴──┘                                       │
│        ▲                    ▲                           │
│  tokens refilled        bucket full:                    │
│  at fixed rate r/s      stop refilling                  │
│                                                         │
│  Each AddRateLimited() consumes one token               │
│  No token available → item must wait                    │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
// Uses golang.org/x/time/rate under the hood
workqueue.NewMaxOfRateLimiter(
    workqueue.NewItemExponentialFailureRateLimiter(5*time.Millisecond, 1000*time.Second),
    &workqueue.BucketRateLimiter{
        Limiter: rate.NewLimiter(rate.Limit(10), 100),
        // 10 tokens/second refill rate, bucket capacity 100
    },
)
Enter fullscreen mode Exit fullscreen mode

How it works:

  • Tokens are added to the bucket at a fixed rate (e.g. 10/sec)
  • Bucket has a maximum capacity — once full, new tokens are discarded
  • Each AddRateLimited() call requires one token
  • If no token is available, the item is delayed until a token is refilled

Used in: Nginx rate limiting, Envoy Ratelimit, Sentinel — token bucket is one of the most widely adopted rate-limiting algorithms in distributed systems.


Algorithm 3: Counter with Fast/Slow Lanes (ItemFastSlowRateLimiter)

The counter algorithm limits how many items can be enqueued within a time window. WorkQueue extends the basic counter with two insertion rates: a fast lane and a slow lane.

Configuration:
  fastDelay       = 5ms      ← interval between items in fast lane
  slowDelay       = 1000ms   ← interval between items in slow lane
  maxFastAttempts = 3        ← how many failures use fast lane before switching

Failure timeline for one item:
┌──────────────────────────────────────────────────────────┐
│  Attempt 1 → delay 5ms    (fast lane, attempt 1/3)       │
│  Attempt 2 → delay 5ms    (fast lane, attempt 2/3)       │
│  Attempt 3 → delay 5ms    (fast lane, attempt 3/3)       │
│  Attempt 4 → delay 1000ms (slow lane — maxFastAttempts   │
│                             exceeded)                    │
│  Attempt 5 → delay 1000ms (slow lane)                    │
│  ...                                                     │
└──────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
workqueue.NewItemFastSlowRateLimiter(
    5*time.Millisecond,   // fastDelay
    1000*time.Millisecond,// slowDelay
    3,                    // maxFastAttempts
)
Enter fullscreen mode Exit fullscreen mode

Use case: Suitable for scenarios where a few quick retries are acceptable (transient network blips), but sustained failures should back off aggressively to avoid resource exhaustion.


6. Choosing a Rate Limiter

Algorithm Best for Behavior
Exponential Backoff Per-item retry (most common) Delay doubles with each failure; resets on success
Token Bucket Global throughput cap Smooth rate limit across all items; burst-friendly
Fast/Slow Counter Tiered retry strategy Quick retries first, then aggressive backoff

In practice, the default rate limiter used by controller-runtime and most Kubernetes controllers combines exponential backoff and token bucket via NewMaxOfRateLimiter — taking the maximum delay from both algorithms:

// Default: use whichever algorithm gives the LONGER delay
// This ensures both per-item backoff AND global rate cap are respected
workqueue.DefaultControllerRateLimiter()
// = NewMaxOfRateLimiter(
//     NewItemExponentialFailureRateLimiter(5ms, 1000s),
//     &BucketRateLimiter{rate.NewLimiter(rate.Limit(10), 100)},
//   )
Enter fullscreen mode Exit fullscreen mode

7. Summary

WorkQueue lifecycle in a controller:

Event fires (Add/Update/Delete)
     │
     ▼
queue.Add(key)          ← deduplicated: same key added once
     │
     ▼
worker: queue.Get(key)  ← blocks until item available
     │
     ├── reconcile OK  → queue.Forget(key) + queue.Done(key)
     │
     └── reconcile ERR → queue.AddRateLimited(key)
                            │
                            └── RateLimiter.When(key) → delay
                                AddAfter(key, delay)  → DelayingQueue
                                → re-enters queue after delay
Enter fullscreen mode Exit fullscreen mode
Component Role
FIFO Queue Base: ordered, deduplicated, concurrent-safe
Delaying Queue Adds AddAfter(item, duration) for time-delayed insertion
RateLimiting Queue Adds AddRateLimited — computes delay via RateLimiter before calling AddAfter
Exponential Backoff Per-item delay that doubles on each failure — prevents hammering a broken resource
Token Bucket Global throughput cap — prevents queue from overwhelming the API Server
Fast/Slow Counter Tiered retry — fast initial retries, slow sustained retries

WorkQueue is the safety net of every Kubernetes controller. Its combination of deduplication, rate limiting, and retry semantics ensures that controllers converge to desired state reliably — even under API Server failures, network partitions, or event storms.


Next in this series: EventBroadcaster: How Kubernetes Records and Distributes Cluster Events (Part 6)


Follow the series for more deep dives into Kubernetes development.

Top comments (0)