Kubernetes is powerful, but troubleshooting issues in a live cluster can be painful. In a complex deployment, critical warning signs often hide in thousands of log lines and events. What if we could surface these reliability issues before they take applications down?
Preq (pronounced "preek") is an open-source tool that brings a proactive approach to Kubernetes troubleshooting. It is a reliability problem detector that checks your cluster's logs, events, and configurations against a community-driven catalog of failure patterns [1]. Using Preq, you can monitor your cluster and catch misconfigurations, anti-patterns, or bugs early, instead of discovering them during a 2 AM incident [1].
Installing preq via Krew
preq is distributed as a kubectl plugin, making it easy to install through the Kubernetes Krew plugin manager. First, ensure you have Krew set up (if not, install it from the official docs). Then install Preq with a single command:
kubectl krew install preq
Within seconds, the plugin is ready to use[1]. There's no extra configuration needed. Preq ships with the latest common reliability enumeration (CRE) rule packages baked in. It auto-updates its rules so you're always scanning for the newest issues.
Running Kubectl preq from the CLI
Once installed, you can run Preq directly via kubectl to check various Kubernetes resources and their logs:
Pods: Scan an individual pod's logs and related events. For example, kubectl preq my-pod-abc123
will fetch that pod's logs and events, then compare them against the CRE rule library. [1]
Services: Running kubectl preq service/my-service
triggers Preq to assess the pods behind that Service. While Services themselves don't have logs, Preq will identify the endpoints/pods for the service and check their logs and events for known issues.
Jobs and CronJobs: Run Preq on a Job or on pods created by a CronJob to inspect execution logs and events[12].
Under the hood, the Preq plugin uses Kubernetes APIs. This means you can run Preq on any resource type that has associated logs or events, giving you a flexible "detective" for your cluster.
Using Preq with ConfigMaps and Events
The current release of Preq primarily targets logs and manifests, but you can also leverage it for configuration files and cluster events with a little creativity.
ConfigMaps
Directly scan a ConfigMap with the plugin:
kubectl preq -n <namespace> configmap/<name-of-config-map>
Kubernetes events
Use this feeder to stream a timestamp and the raw event into Preq:
kubectl get events -A -o json | jq -r '.items[] | "\(.metadata.creationTimestamp) \(tojson)"' | kubectl preq
Other workload configurations beyond ConfigMaps
Use this workaround to feed deployments and similar manifests as compact JSON, stamped with a single UTC timestamp per line:
kubectl get deploy -A -o json | jq -c . | sed -e "1s/^/$(date -u +"%Y-%m-%dT%H:%M:%SZ") /" | kubectl preq
Example CREs
We'll highlight a few Common Reliability Enumerations created by community members[2][3][4][5]:
CRE | What breaks | Signals you will see |
---|---|---|
CRE-2025-0119 | Too many pods down during an update | Rollout stalls, unavailable replicas, PDB budget exceeded |
CRE-2025-0071 | Cluster DNS resolution fails when CoreDNS has no ready pods or endpoints | CoreDNS availableReplicas at zero, kube dns endpoints empty, pods in CrashLoopBackOff, CoreDNS logs show errors |
CRE-2025-0048 | Worker node enters NotReady because control plane cannot resolve the node's FQDN | Node status shows NotReady without resource pressure, control plane logs may show hostname resolution errors |
CRE-2025-0125 | Kubelet crashes under rapid pod launches causing node NotReady and full node level outage with pod evictions | Node NotReady, mass pod evictions and rescheduling, kubelet logs show panic in EventedPLEG evented.go |
Exit Code CREs: Crashes and Their Causes (137, 127, 134, 139)
Now let's talk about a different kind of problem: when your containers keep exiting with mysterious status codes. Preq includes CRE rules(7) for common exit codes to help pinpoint why a container crashed. Let's break down the usual suspects:
Exit Code 137
This typically means the process was killed with SIGKILL, which in Kubernetes often implies an out-of-memory kill. In other words, the container was using more memory than allowed, so the OS OOM killer terminated it[6][7]. It can also happen if someone manually kill -9 the process, but OOM is the usual cause. In Kubernetes you'll often see "Reason: OOMKilled" in the pod's status if this is the case.
Why it happens: Your app exceeded its memory limit or the node ran out of memory.
What to do: Check the container's memory limits and usage. You can run kubectl top pod
to see if it was using a lot of memory. Increasing the memory limit (or request) for the container can prevent the OOMKill, or optimize the application to use less memory. Preq can help by flagging frequent OOM kills so you know to take action before it impacts users.
Exit Code 127
This means "command not found". The process tries to execute a file or command that doesn't exist in the container's filesystem [8][9]. It's a common error when the container's start command or entrypoint is misconfigured.
Why it happens: Either the binary isn't installed, the path is wrong, or a dependency is missing. It can also be a shell quoting issue or a file permission problem, but usually it's a missing executable.
What to do: Describe the pod (kubectl describe pod
), often Kubernetes will log a message about "command not found" in the events. Fix the command in your container spec or Dockerfile. Make sure the image has the expected program at the correct location. Preq can catch this by scanning events or termination messages for "exited with code 127" and common error text. The solution is usually straightforward: install the missing tool or correct the command path.
Exit Code 134
This indicates the process received SIGABRT (abort signal)[10]. In plainer terms, the application crashed itself – often due to an internal error like an assertion failure or a call to abort(), or it was terminated after a fatal error.
Why it happens: Common causes include bugs (like asserting false or invalid memory access that is caught leading to abort), or sometimes out-of-memory in a different way. Another cause can be hitting a resource limit that triggers an abort.
What to do: Check the container's logs for any error messages or stack traces right before it exited. Often you'll see a line about an assertion or fatal error. Preq will highlight the occurrence of a 134 exit and can point out if it's a known pattern. Ensure you're not hitting known bugs in the app version, and consider adding liveness probes, if a container aborts frequently.
Exit Code 139
This is the infamous segmentation fault (SIGSEGV)[10]. The process tried to access memory it shouldn't (invalid pointer, buffer overflow, etc.), and the OS killed it. This is almost always a bug in the application code (or a library it's using).
Why it happens: A segfault can be caused by many things: using a null pointer, reading/writing out of bounds, incompatible native libraries, etc. In some cases, even running out of stack can cause a segfault.
What to do: As with 134, the primary action is to check application logs or enable core dumps for debugging. If the segfault happens on startup, it could be an incompatibility (for example, wrong CPU architecture or missing dependencies causing a segfault). Ensure the image is built for the correct architecture. Preq's rule for 139 will basically alert you that a container hit SIGSEGV. It can't fix the code, but it ensures you notice the crash.
Detecting exit codes with a unified command
You can quickly scan your cluster for any pods that terminated with these exit codes using this exact raw text feeder to scan for exit codes and pipe into Preq:
kubectl get pods --all-namespaces -o json | jq -r '
.items[] as $p
| [ ($p.status.containerStatuses // []),
($p.status.initContainerStatuses // []),
($p.status.ephemeralContainerStatuses // []) ]
| add
| .[]
| (.lastState.terminated // .state.terminated) as $t
| select($t != null and $t.exitCode != null and $t.finishedAt != null)
| [ $t.finishedAt,
($p.metadata.namespace + "/" + $p.metadata.name),
.name,
($t.reason // ""),
($t.exitCode|tostring) ]
| @TSV' | preq
Conclusion: Using Preq in Your Daily Workflow
In an ideal world, you can catch problems before they cause downtime and that's where Preq shines. Adopting preq in your day-to-day Kubernetes workflows can significantly reduce mean-time-to-detection for issues:
CI/CD Integration: Consider running Preq as a post-deploy check in your continuous deployment pipeline. For example, after deploying a new version of an application, have a step that runs kubectl preq
on that namespace or on the specific new pods.
Proactive scheduled runs: Use kubectl preq -j
to generate a Kubernetes CronJob template. It writes cronjob.yaml. Open the file, set the schedule, add the Preq command you want to run including any -a action and -o output, set the namespace, then apply it with kubectl apply -f cronjob.yaml
Note that kubectl preq does not support an all namespaces flag. To scan many targets, pass a data source template to Preq or wrap multiple invocations in a small script that the CronJob runs.
Post-mortem and Continuous Improvement: After any incident or outage, consider writing a new CRE rule (and contributing it!) if it was a novel issue. Preq's framework lets you codify that knowledge so that neither you nor anyone else gets bitten by the same problem twice.
In summary, Preq is a powerful ally for Kubernetes users. It turns the wealth of community experience with failure modes into actionable insights you can run on-demand. By incorporating Preq into CI/CD pipelines, scheduled scans, and troubleshooting sessions, you can proactively detect and resolve issues – often before they turn into user-facing incidents. Happy monitoring, and may your clusters run clean and healthy!
If you're looking for enterprise features such as:
- a distributed detection engine that runs across many nodes and clusters
- a web UI with guided workflows for investigation and collaboration
- deeper integrations (for incident tracking, etc.)
- a control plane for managing the distributed engine
- a larger, proprietary set of CRE rules maintained by the Prequel Reliability Research team (PRRT).
Check out Prequel, our commercial offering and let us know what you think!
References
[1] Dev.to: 10 kubectl Plugins That Help Make You the Most Valuable Kubernetes Engineer in the Room
[2] CRE-2025-0119 | Prequel
[3] CRE-2025-0071 | Prequel
[4] CRE-2025-0048 | Prequel
[5] CRE-2025-0125 | Prequel
[6] Stack Overflow: Kubernetes Pods Terminated Exit Code 137
[7] Exit Code CREs | Prequel
[8] Installing Krew
[9] Schedule preq to run in a Cronjob
Top comments (0)