A pod crashes at 3am. Kubernetes restarts it. It crashes again. Kubernetes keeps trying.
Meanwhile you are asleep. Nobody reads the logs. Nobody fixes anything. The pod just keeps crashing until someone wakes up, opens a terminal, and figures out what went wrong.
I built a system that handles the reading and thinking part automatically. When a pod fails, it pulls the logs, asks Claude what went wrong, and opens a GitHub PR with a suggested fix. By the time you wake up, the analysis is already done.
What it actually does
The system runs as a .NET background service alongside your cluster. It streams Kubernetes events in real time. When it sees a pod enter CrashLoopBackOff, OOMKilled, ImagePullError, or exit with a non-zero code, it kicks off a healing process.
Here is what happens in order:
- Pod failure detected
- Last 100 lines of logs pulled from the pod
- Logs and failure details sent to Claude API
- Claude returns root cause, severity, and a suggested fix
- A branch is created on GitHub
- The analysis is committed as a markdown file
- A PR is opened automatically
The whole thing takes about 10 seconds from crash to PR.
A real example
I deployed a pod that intentionally fails to connect to a database. Here is what Claude wrote in the PR:
"The application failed to establish a connection to the PostgreSQL database at postgres://db:5432. The pod crashed with exit code 1 after logging ERROR: Database connection failed and FATAL: Cannot connect to postgres://db:5432. This indicates either the database service is unreachable, not running, or the connection string is incorrect."
Then it generated the Kubernetes Service YAML to fix the service discovery and listed seven kubectl commands to diagnose the issue. In a PR. Automatically. While I was watching the logs.
How it is built
There are four pieces.
KubernetesWatcher is a .NET BackgroundService that uses KubernetesClient to stream pod events. It looks for specific failure conditions in the container status:
private FailureEvent? DetectFailure(V1Pod pod, V1ContainerStatus status)
{
var waiting = status.State?.Waiting;
var terminated = status.State?.Terminated;
if (waiting?.Reason == "CrashLoopBackOff")
return CreateFailure(pod, FailureType.CrashLoopBackOff, waiting.Reason, waiting.Message ?? "");
if (waiting?.Reason == "ImagePullBackOff" || waiting?.Reason == "ErrImagePull")
return CreateFailure(pod, FailureType.ImagePullError, waiting.Reason, waiting.Message ?? "");
if (terminated?.Reason == "OOMKilled")
return CreateFailure(pod, FailureType.OOMKilled, "OOMKilled", "Container exceeded memory limit");
if (terminated?.ExitCode > 0 && status.RestartCount > 0)
return CreateFailure(pod, FailureType.PodCrash, "CrashExit",
$"Container exited with code {terminated.ExitCode}");
return null;
}
ClaudeAnalyser takes the failure and logs, builds a prompt, and calls the Anthropic API. It asks for structured JSON back so the response is easy to parse:
var response = await _client.Messages.GetClaudeMessageAsync(
new MessageParameters
{
Model = "claude-haiku-4-5-20251001",
MaxTokens = 2048,
Messages = new List<Message>
{
new Message
{
Role = RoleType.User,
Content = new List<ContentBase>
{
new TextContent { Text = prompt }
}
}
}
});
The prompt tells Claude to act as a Kubernetes expert and return root cause, severity, a plain English fix, and the actual code or config change. Haiku is fast enough for this and costs around $0.0008 per analysis.
GitHubService uses Octokit to create a branch, commit the analysis as a markdown file, and open a PR:
await _client.Repository.Content.CreateFile(
_settings.GitHubRepoOwner,
_settings.GitHubRepoName,
fileName,
new CreateFileRequest(
message: $"fix: self-healing patch for {result.OriginalFailure.Type} in {result.PodName}",
content: Convert.ToBase64String(Encoding.UTF8.GetBytes(fixContent)),
branch: branchName));
HealingOrchestrator coordinates the three above and makes sure the same failure does not trigger duplicate healing attempts while the first one is still running.
The prompt matters more than the code
I spent more time on the Claude prompt than on anything else in this project. Telling it to return structured JSON, referencing actual log lines, and specifying the failure type all produce much better output than a generic ask.
The prompt I settled on:
You are a Kubernetes expert and .NET engineer analysing a production failure.
Analyse this Kubernetes pod failure and respond ONLY with valid JSON.
FAILURE DETAILS:
Pod: {pod name}
Namespace: {namespace}
Failure Type: {type}
Reason: {reason}
POD LOGS (last 100 lines):
{logs}
Respond with this exact JSON structure:
{
"rootCause": "Clear explanation of what caused the failure",
"severity": "Critical|High|Medium|Low",
"suggestedFix": "Step by step fix in plain English",
"codeFix": "The actual code or config change needed. Empty string if none.",
"fixType": "config|code|resources|image|none"
}
Be specific. Reference actual log lines where relevant.
The key instruction is "respond ONLY with valid JSON". Without that, Claude adds explanation text around the JSON and the parser breaks.
Running it locally
You need .NET 10, Docker Desktop with Kubernetes enabled, an Anthropic API key, and a GitHub personal access token with repo permissions.
git clone https://github.com/aftabkh4n/genai-devops-platform.git
cd genai-devops-platform/DeploymentService
dotnet run
Add your keys to appsettings.json and you will see:
[INF] Kubernetes watcher started. Watching namespace: default
[INF] Now listening on: http://localhost:0000
To test it, deploy the crash test included in the repo:
kubectl apply -f crash-test.yaml
Within a few seconds you will see the failure detected in the console, Claude called, and a PR URL printed. Go check your GitHub repo.
What I learned
The watcher reconnects automatically when the Kubernetes stream disconnects. This happens more often than you would think, especially after cluster restarts. Wrapping the watch loop in a try/catch with a 5 second delay before reconnecting keeps it stable.
The SemaphoreSlim in HealingOrchestrator is important. A pod in CrashLoopBackOff generates events every few seconds. Without deduplication you end up with ten simultaneous Claude API calls for the same pod, ten branches, and ten PRs. Not ideal.
Using jq to build JSON payloads is safer than shell string interpolation. Pod names and log lines contain characters that break JSON when interpolated directly.
Source code: https://github.com/aftabkh4n/genai-devops-platform
If you try it and find a failure type it misses, open an issue.
Top comments (0)