DEV Community

Kahiro Okina
Kahiro Okina

Posted on

Kubernetes Actions Runner Controller: A Thorough Explanation of PR #4059 Enabling Self-Healing for EphemeralRunner

1. Background ― Why was this PR necessary?

Symptom Previous Implementation
Spot instances terminate when their time is up, and EphemeralRunner mistakenly judges this as a failure. Failures accumulated in EphemeralRunner's status.failures one by one.
When using Spot instances with minRunners ≥ 1, you inevitably encounter Spot instance interruptions. After 5 failures (interruptions), the Runner CR remains with Phase=Failed.
From the Workflow side, jobs stall in a "Runner not found" state. Reported about 2 years ago in Issue #2721

2. Previous Limitations and Challenges

  • The failure limit of 5 times constant existed before
  • After reaching the limit, it only changed the Phase with markAsFailed(), so:
    • EphemeralRunnerSet determined "Desired=1 but object already exists"
    • It couldn't generate a new EphemeralRunner and became "zombified"

3. What Changed with PR #4059?

Change Role
Type change for status.failures
map[string]boolmap[string]metav1.Time
Records failure time to track the latest failures chronologically
Introduction of failedRunnerBackoff array Exponential backoff: 0 s → 5 s → 10 s → 20 s → 40 s → 80 s (GitHub)
Added LastFailure() helper Retrieves the most recent failure time from Status (GitHub)
Requeue control For failure count n ≤ 5, process equivalent to RequeueAfter(backoff[n]) (GitHub)
Changed handling of limit exceedance Instead of changing Phase, immediately Delete() the EphemeralRunner CR.
EphemeralRunnerSet can then regenerate a new one (GitHub)

💡 Point
The constant limit of 5 times itself hasn't changed.
"What happens after reaching the limit" changed from Phase=Failed → object deletion, and
Self-healing of EphemeralRunner became possible


4. Actual Behavior in Spot × minRunners Scenario

sequenceDiagram
    autonumber
    participant ARC as ARC Controller
    participant ER as EphemeralRunner CR
    participant Set as EphemeralRunnerSet
    participant Node as Spot Node

    Node-->>ER: Interruption warning (2 minutes in advance)
    Node--x ER: Pod Terminated
    ARC->>ER: failures+=1 / RequeueAfter(backoff[1])
    loop ~Repeats up to 5 times~
        Node--x ER: Terminated again
        ARC->>ER: failures+=1 / Backoff
    end
    ARC->>ER: failures=6 (>5)\nDelete(ER)
    Set->>ARC: Desired=1 but no CR
    Set-->>ARC: **Create new EphemeralRunner**
Enter fullscreen mode Exit fullscreen mode

Image description

  • On the 6th time, it deletes CR → regenerates, so "zombies" don't remain

5. Code Explanation

1. Status Field for Recording Failures

type EphemeralRunnerStatus struct {
    // ...existing fields
    Failures map[string]metav1.Time `json:"failures,omitempty"`
}

func (s *EphemeralRunnerStatus) LastFailure() *metav1.Time {
    // Helper that scans the map and returns the latest timestamp
}
Enter fullscreen mode Exit fullscreen mode
  • Key: Pod's .metadata.uid
  • Value: Time when the failure occurred

2. Exponential Backoff & Requeue

var failedRunnerBackoff = []time.Duration{
    0, 5 * time.Second, 10 * time.Second,
    20 * time.Second, 40 * time.Second, 80 * time.Second,
}
const maxFailures = 5
Enter fullscreen mode Exit fullscreen mode
if len(er.Status.Failures) <= maxFailures {
    delay := failedRunnerBackoff[len(er.Status.Failures)]
    return ctrl.Result{RequeueAfter: delay}, nil
}
Enter fullscreen mode Exit fullscreen mode
  • Retry with 0 seconds → 5 s → 10 s ...
  • On the 6th attempt (len > maxFailures), it proceeds to the next section

3. Immediate Deletion on 6th Time → EphemeralRunnerSet Recreates

// Final phase
if len(er.Status.Failures) > maxFailures {
    log.Info("Maximum failures reached – deleting EphemeralRunner")
    if err := r.Delete(ctx, er); err != nil { ... }
    return ctrl.Result{}, nil
}
Enter fullscreen mode Exit fullscreen mode
  • Since it Deletes the object itself, the Controller's retry ends
  • The parent resource EphemeralRunnerSet generates a new EphemeralRunner to meet the Desired count → Pod → job resumes

Operation Flow Summary

graph TD
    A[Pod Failed] --> B{Failures <= 5?}
    B -- Yes --> C[Record failure & requeue after backoff]
    B -- No --> D[Delete EphemeralRunner CR]
    D --> E[EphemeralRunnerSet creates new EphemeralRunner]
Enter fullscreen mode Exit fullscreen mode

Image description


6. Conclusion

PR #4059 solves the long-standing issue (Issue #2721) of "getting stuck after 5 interruptions" that's particularly frequent in Spot × minRunners environments.

  • Manages failures chronologically and retries with exponential backoff
  • Deletes the EphemeralRunner CR upon exceeding the limit, allowing EphemeralRunnerSet to recreate it
  • This achieves self-healing close to fully managed
  • However, the risk of job execution being disrupted by Spot interruptions still remains, so caution is needed when using Spot instances

Top comments (0)