1. Background ― Why was this PR necessary?
Symptom | Previous Implementation |
---|---|
Spot instances terminate when their time is up, and EphemeralRunner mistakenly judges this as a failure. | Failures accumulated in EphemeralRunner's status.failures one by one. |
When using Spot instances with minRunners ≥ 1 , you inevitably encounter Spot instance interruptions. |
After 5 failures (interruptions), the Runner CR remains with Phase=Failed . |
From the Workflow side, jobs stall in a "Runner not found" state. | Reported about 2 years ago in Issue #2721 |
2. Previous Limitations and Challenges
- The failure limit of 5 times constant existed before
- After reaching the limit, it only changed the Phase with
markAsFailed()
, so:- EphemeralRunnerSet determined "Desired=1 but object already exists"
- It couldn't generate a new EphemeralRunner and became "zombified"
3. What Changed with PR #4059?
Change | Role |
---|---|
Type change for status.failures map[string]bool → map[string]metav1.Time
|
Records failure time to track the latest failures chronologically |
Introduction of failedRunnerBackoff array |
Exponential backoff: 0 s → 5 s → 10 s → 20 s → 40 s → 80 s (GitHub) |
Added LastFailure() helper |
Retrieves the most recent failure time from Status (GitHub) |
Requeue control | For failure count n ≤ 5, process equivalent to RequeueAfter(backoff[n]) (GitHub) |
Changed handling of limit exceedance | Instead of changing Phase, immediately Delete() the EphemeralRunner CR.EphemeralRunnerSet can then regenerate a new one (GitHub) |
💡 Point
The constant limit of 5 times itself hasn't changed.
"What happens after reaching the limit" changed from Phase=Failed → object deletion, and
Self-healing of EphemeralRunner became possible
4. Actual Behavior in Spot × minRunners Scenario
sequenceDiagram
autonumber
participant ARC as ARC Controller
participant ER as EphemeralRunner CR
participant Set as EphemeralRunnerSet
participant Node as Spot Node
Node-->>ER: Interruption warning (2 minutes in advance)
Node--x ER: Pod Terminated
ARC->>ER: failures+=1 / RequeueAfter(backoff[1])
loop ~Repeats up to 5 times~
Node--x ER: Terminated again
ARC->>ER: failures+=1 / Backoff
end
ARC->>ER: failures=6 (>5)\nDelete(ER)
Set->>ARC: Desired=1 but no CR
Set-->>ARC: **Create new EphemeralRunner**
- On the 6th time, it deletes CR → regenerates, so "zombies" don't remain
5. Code Explanation
1. Status Field for Recording Failures
type EphemeralRunnerStatus struct {
// ...existing fields
Failures map[string]metav1.Time `json:"failures,omitempty"`
}
func (s *EphemeralRunnerStatus) LastFailure() *metav1.Time {
// Helper that scans the map and returns the latest timestamp
}
- Key: Pod's .metadata.uid
- Value: Time when the failure occurred
2. Exponential Backoff & Requeue
var failedRunnerBackoff = []time.Duration{
0, 5 * time.Second, 10 * time.Second,
20 * time.Second, 40 * time.Second, 80 * time.Second,
}
const maxFailures = 5
if len(er.Status.Failures) <= maxFailures {
delay := failedRunnerBackoff[len(er.Status.Failures)]
return ctrl.Result{RequeueAfter: delay}, nil
}
- Retry with 0 seconds → 5 s → 10 s ...
- On the 6th attempt (
len > maxFailures
), it proceeds to the next section
3. Immediate Deletion on 6th Time → EphemeralRunnerSet Recreates
// Final phase
if len(er.Status.Failures) > maxFailures {
log.Info("Maximum failures reached – deleting EphemeralRunner")
if err := r.Delete(ctx, er); err != nil { ... }
return ctrl.Result{}, nil
}
- Since it Deletes the object itself, the Controller's retry ends
- The parent resource EphemeralRunnerSet generates a new EphemeralRunner to meet the Desired count → Pod → job resumes
Operation Flow Summary
graph TD
A[Pod Failed] --> B{Failures <= 5?}
B -- Yes --> C[Record failure & requeue after backoff]
B -- No --> D[Delete EphemeralRunner CR]
D --> E[EphemeralRunnerSet creates new EphemeralRunner]
6. Conclusion
PR #4059 solves the long-standing issue (Issue #2721) of "getting stuck after 5 interruptions" that's particularly frequent in Spot × minRunners environments.
- Manages failures chronologically and retries with exponential backoff
- Deletes the EphemeralRunner CR upon exceeding the limit, allowing EphemeralRunnerSet to recreate it
- This achieves self-healing close to fully managed
- However, the risk of job execution being disrupted by Spot interruptions still remains, so caution is needed when using Spot instances
Top comments (0)