Kahiro Okina

Posted on May 15

Kubernetes Actions Runner Controller: A Thorough Explanation of PR #4059 Enabling Self-Healing for EphemeralRunner

#kubernetes #githubactions #cicd

1. Background ― Why was this PR necessary?

Symptom	Previous Implementation
Spot instances terminate when their time is up, and EphemeralRunner mistakenly judges this as a failure.	Failures accumulated in EphemeralRunner's `status.failures` one by one.
When using Spot instances with `minRunners ≥ 1`, you inevitably encounter Spot instance interruptions.	After 5 failures (interruptions), the Runner CR remains with `Phase=Failed`.
From the Workflow side, jobs stall in a "Runner not found" state.	Reported about 2 years ago in Issue #2721

2. Previous Limitations and Challenges

The failure limit of 5 times constant existed before
After reaching the limit, it only changed the Phase with markAsFailed(), so:
- EphemeralRunnerSet determined "Desired=1 but object already exists"
- It couldn't generate a new EphemeralRunner and became "zombified"

3. What Changed with PR #4059?

Change	Role
Type change for `status.failures` `map[string]bool` → `map[string]metav1.Time`	Records failure time to track the latest failures chronologically
Introduction of `failedRunnerBackoff` array	Exponential backoff: 0 s → 5 s → 10 s → 20 s → 40 s → 80 s (GitHub)
Added `LastFailure()` helper	Retrieves the most recent failure time from Status (GitHub)
Requeue control	For failure count n ≤ 5, process equivalent to `RequeueAfter(backoff[n])` (GitHub)
Changed handling of limit exceedance	Instead of changing Phase, immediately `Delete()` the EphemeralRunner CR. EphemeralRunnerSet can then regenerate a new one (GitHub)

💡 Point
The constant limit of 5 times itself hasn't changed.
"What happens after reaching the limit" changed from Phase=Failed → object deletion, and
Self-healing of EphemeralRunner became possible

4. Actual Behavior in Spot × minRunners Scenario

sequenceDiagram
    autonumber
    participant ARC as ARC Controller
    participant ER as EphemeralRunner CR
    participant Set as EphemeralRunnerSet
    participant Node as Spot Node

    Node-->>ER: Interruption warning (2 minutes in advance)
    Node--x ER: Pod Terminated
    ARC->>ER: failures+=1 / RequeueAfter(backoff[1])
    loop ~Repeats up to 5 times~
        Node--x ER: Terminated again
        ARC->>ER: failures+=1 / Backoff
    end
    ARC->>ER: failures=6 (>5)\nDelete(ER)
    Set->>ARC: Desired=1 but no CR
    Set-->>ARC: **Create new EphemeralRunner**

On the 6th time, it deletes CR → regenerates, so "zombies" don't remain

5. Code Explanation

1. Status Field for Recording Failures

type EphemeralRunnerStatus struct {
    // ...existing fields
    Failures map[string]metav1.Time `json:"failures,omitempty"`
}

func (s *EphemeralRunnerStatus) LastFailure() *metav1.Time {
    // Helper that scans the map and returns the latest timestamp
}

Key: Pod's .metadata.uid
Value: Time when the failure occurred

2. Exponential Backoff & Requeue

var failedRunnerBackoff = []time.Duration{
    0, 5 * time.Second, 10 * time.Second,
    20 * time.Second, 40 * time.Second, 80 * time.Second,
}
const maxFailures = 5

if len(er.Status.Failures) <= maxFailures {
    delay := failedRunnerBackoff[len(er.Status.Failures)]
    return ctrl.Result{RequeueAfter: delay}, nil
}

Retry with 0 seconds → 5 s → 10 s ...
On the 6th attempt (len > maxFailures), it proceeds to the next section

3. Immediate Deletion on 6th Time → EphemeralRunnerSet Recreates

// Final phase
if len(er.Status.Failures) > maxFailures {
    log.Info("Maximum failures reached – deleting EphemeralRunner")
    if err := r.Delete(ctx, er); err != nil { ... }
    return ctrl.Result{}, nil
}

Since it Deletes the object itself, the Controller's retry ends
The parent resource EphemeralRunnerSet generates a new EphemeralRunner to meet the Desired count → Pod → job resumes

Operation Flow Summary

graph TD
    A[Pod Failed] --> B{Failures <= 5?}
    B -- Yes --> C[Record failure & requeue after backoff]
    B -- No --> D[Delete EphemeralRunner CR]
    D --> E[EphemeralRunnerSet creates new EphemeralRunner]

6. Conclusion

PR #4059 solves the long-standing issue (Issue #2721) of "getting stuck after 5 interruptions" that's particularly frequent in Spot × minRunners environments.

Manages failures chronologically and retries with exponential backoff
Deletes the EphemeralRunner CR upon exceeding the limit, allowing EphemeralRunnerSet to recreate it
This achieves self-healing close to fully managed
However, the risk of job execution being disrupted by Spot interruptions still remains, so caution is needed when using Spot instances

DEV Community