Lyndon Brown

Posted on Sep 12 • Originally published at prequel.dev

How I find and fix Kubernetes Exit Codes and Misconfigurations for free

Kubernetes is powerful, but troubleshooting issues in a live cluster can be painful. In a complex deployment, critical warning signs often hide in thousands of log lines and events. What if we could surface these reliability issues before they take applications down?

Preq (pronounced "preek") is an open-source tool that brings a proactive approach to Kubernetes troubleshooting. It is a reliability problem detector that checks your cluster's logs, events, and configurations against a community-driven catalog of failure patterns [1]. Using Preq, you can monitor your cluster and catch misconfigurations, anti-patterns, or bugs early, instead of discovering them during a 2 AM incident [1].

Installing preq via Krew

preq is distributed as a kubectl plugin, making it easy to install through the Kubernetes Krew plugin manager. First, ensure you have Krew set up (if not, install it from the official docs). Then install Preq with a single command:

kubectl krew install preq

Within seconds, the plugin is ready to use[1]. There's no extra configuration needed. Preq ships with the latest common reliability enumeration (CRE) rule packages baked in. It auto-updates its rules so you're always scanning for the newest issues.

Running Kubectl preq from the CLI

Once installed, you can run Preq directly via kubectl to check various Kubernetes resources and their logs:

Pods: Scan an individual pod's logs and related events. For example, kubectl preq my-pod-abc123 will fetch that pod's logs and events, then compare them against the CRE rule library. [1]

Services: Running kubectl preq service/my-service triggers Preq to assess the pods behind that Service. While Services themselves don't have logs, Preq will identify the endpoints/pods for the service and check their logs and events for known issues.

Jobs and CronJobs: Run Preq on a Job or on pods created by a CronJob to inspect execution logs and events[12].

Under the hood, the Preq plugin uses Kubernetes APIs. This means you can run Preq on any resource type that has associated logs or events, giving you a flexible "detective" for your cluster.

Using Preq with ConfigMaps and Events

The current release of Preq primarily targets logs and manifests, but you can also leverage it for configuration files and cluster events with a little creativity.

ConfigMaps

Directly scan a ConfigMap with the plugin:

kubectl preq -n <namespace> configmap/<name-of-config-map>

Kubernetes events

Use this feeder to stream a timestamp and the raw event into Preq:

kubectl get events -A -o json | jq -r '.items[] | "\(.metadata.creationTimestamp) \(tojson)"' | kubectl preq

Other workload configurations beyond ConfigMaps

Use this workaround to feed deployments and similar manifests as compact JSON, stamped with a single UTC timestamp per line:

kubectl get deploy -A -o json | jq -c . | sed -e "1s/^/$(date -u +"%Y-%m-%dT%H:%M:%SZ") /" | kubectl preq

Example CREs

We'll highlight a few Common Reliability Enumerations created by community members[2][3][4][5]:

CRE	What breaks	Signals you will see
CRE-2025-0119	Too many pods down during an update	Rollout stalls, unavailable replicas, PDB budget exceeded
CRE-2025-0071	Cluster DNS resolution fails when CoreDNS has no ready pods or endpoints	CoreDNS availableReplicas at zero, kube dns endpoints empty, pods in CrashLoopBackOff, CoreDNS logs show errors
CRE-2025-0048	Worker node enters NotReady because control plane cannot resolve the node's FQDN	Node status shows NotReady without resource pressure, control plane logs may show hostname resolution errors
CRE-2025-0125	Kubelet crashes under rapid pod launches causing node NotReady and full node level outage with pod evictions	Node NotReady, mass pod evictions and rescheduling, kubelet logs show panic in EventedPLEG evented.go

Exit Code CREs: Crashes and Their Causes (137, 127, 134, 139)

Now let's talk about a different kind of problem: when your containers keep exiting with mysterious status codes. Preq includes CRE rules(7) for common exit codes to help pinpoint why a container crashed. Let's break down the usual suspects:

Exit Code 137

This typically means the process was killed with SIGKILL, which in Kubernetes often implies an out-of-memory kill. In other words, the container was using more memory than allowed, so the OS OOM killer terminated it[6][7]. It can also happen if someone manually kill -9 the process, but OOM is the usual cause. In Kubernetes you'll often see "Reason: OOMKilled" in the pod's status if this is the case.

Why it happens: Your app exceeded its memory limit or the node ran out of memory.

What to do: Check the container's memory limits and usage. You can run kubectl top pod to see if it was using a lot of memory. Increasing the memory limit (or request) for the container can prevent the OOMKill, or optimize the application to use less memory. Preq can help by flagging frequent OOM kills so you know to take action before it impacts users.

Exit Code 127

This means "command not found". The process tries to execute a file or command that doesn't exist in the container's filesystem [8][9]. It's a common error when the container's start command or entrypoint is misconfigured.

Why it happens: Either the binary isn't installed, the path is wrong, or a dependency is missing. It can also be a shell quoting issue or a file permission problem, but usually it's a missing executable.

What to do: Describe the pod (kubectl describe pod), often Kubernetes will log a message about "command not found" in the events. Fix the command in your container spec or Dockerfile. Make sure the image has the expected program at the correct location. Preq can catch this by scanning events or termination messages for "exited with code 127" and common error text. The solution is usually straightforward: install the missing tool or correct the command path.

Exit Code 134

This indicates the process received SIGABRT (abort signal)[10]. In plainer terms, the application crashed itself – often due to an internal error like an assertion failure or a call to abort(), or it was terminated after a fatal error.

Why it happens: Common causes include bugs (like asserting false or invalid memory access that is caught leading to abort), or sometimes out-of-memory in a different way. Another cause can be hitting a resource limit that triggers an abort.

What to do: Check the container's logs for any error messages or stack traces right before it exited. Often you'll see a line about an assertion or fatal error. Preq will highlight the occurrence of a 134 exit and can point out if it's a known pattern. Ensure you're not hitting known bugs in the app version, and consider adding liveness probes, if a container aborts frequently.

Exit Code 139

This is the infamous segmentation fault (SIGSEGV)[10]. The process tried to access memory it shouldn't (invalid pointer, buffer overflow, etc.), and the OS killed it. This is almost always a bug in the application code (or a library it's using).

Why it happens: A segfault can be caused by many things: using a null pointer, reading/writing out of bounds, incompatible native libraries, etc. In some cases, even running out of stack can cause a segfault.

What to do: As with 134, the primary action is to check application logs or enable core dumps for debugging. If the segfault happens on startup, it could be an incompatibility (for example, wrong CPU architecture or missing dependencies causing a segfault). Ensure the image is built for the correct architecture. Preq's rule for 139 will basically alert you that a container hit SIGSEGV. It can't fix the code, but it ensures you notice the crash.

Detecting exit codes with a unified command

You can quickly scan your cluster for any pods that terminated with these exit codes using this exact raw text feeder to scan for exit codes and pipe into Preq:

kubectl get pods --all-namespaces -o json | jq -r '
.items[] as $p
| [ ($p.status.containerStatuses // []),
($p.status.initContainerStatuses // []),
($p.status.ephemeralContainerStatuses // []) ]
| add
| .[]
| (.lastState.terminated // .state.terminated) as $t
| select($t != null and $t.exitCode != null and $t.finishedAt != null)
| [ $t.finishedAt,
($p.metadata.namespace + "/" + $p.metadata.name),
.name,
($t.reason // ""),
($t.exitCode|tostring) ]
| @TSV' | preq

Conclusion: Using Preq in Your Daily Workflow

In an ideal world, you can catch problems before they cause downtime and that's where Preq shines. Adopting preq in your day-to-day Kubernetes workflows can significantly reduce mean-time-to-detection for issues:

CI/CD Integration: Consider running Preq as a post-deploy check in your continuous deployment pipeline. For example, after deploying a new version of an application, have a step that runs kubectl preq on that namespace or on the specific new pods.

Proactive scheduled runs: Use kubectl preq -j to generate a Kubernetes CronJob template. It writes cronjob.yaml. Open the file, set the schedule, add the Preq command you want to run including any -a action and -o output, set the namespace, then apply it with kubectl apply -f cronjob.yaml

Note that kubectl preq does not support an all namespaces flag. To scan many targets, pass a data source template to Preq or wrap multiple invocations in a small script that the CronJob runs.

Post-mortem and Continuous Improvement: After any incident or outage, consider writing a new CRE rule (and contributing it!) if it was a novel issue. Preq's framework lets you codify that knowledge so that neither you nor anyone else gets bitten by the same problem twice.

In summary, Preq is a powerful ally for Kubernetes users. It turns the wealth of community experience with failure modes into actionable insights you can run on-demand. By incorporating Preq into CI/CD pipelines, scheduled scans, and troubleshooting sessions, you can proactively detect and resolve issues – often before they turn into user-facing incidents. Happy monitoring, and may your clusters run clean and healthy!

If you're looking for enterprise features such as:

a distributed detection engine that runs across many nodes and clusters
a web UI with guided workflows for investigation and collaboration
deeper integrations (for incident tracking, etc.)
a control plane for managing the distributed engine
a larger, proprietary set of CRE rules maintained by the Prequel Reliability Research team (PRRT).

Check out Prequel, our commercial offering and let us know what you think!