DEV Community: Lyndon Brown

The Real State of Helm Chart Reliability (2025): Hidden Risks in 100+ Open‑Source Charts

Lyndon Brown — Tue, 04 Nov 2025 19:55:35 +0000

tldr

Prequel's reliability research team audited 105 popular Kubernetes Helm charts to reveal missing reliability safeguards.
- The average score was ~3.98/10
- 48% (50 charts) rated "High Risk" (score ≤3/10)
- Only 17% (18 charts) were rated "Reliable" (≥7/10)
Key missing features include
- Pod Topology Spread Constraints (93% absent)
- PodDisruptionBudget (74% absent)
- Horizontal Pod Autoscalers (75% absent)
- CPU/Memory resource requests/limits (50–60% absent)
Several 0/10 charts were DaemonSets (e.g., Fluent Bit, node-exporter, GPU plugins) where PDB/TopologySpread/HPA/Replicas are generally not applicable.
It’s important to note that a low score does not necessarily mean the software itself is bad; rather, it means the default deployment setup might not offer high reliability standards.
We recommend end users patch missing controls via values.yaml or via Helm overlays.
Users should use continuous reliability protection tools like Prequel to identify missing safeguards and monitor for impact.

Introduction

Reliability is one of the main reasons teams adopt Kubernetes, it promises self-healing workloads, automated rollouts, and consistent recovery across environments. However, it is easy to undercut these advantages. A Helm chart packages the Kubernetes manifests that define how an application is deployed. When those charts omit best practices or include misconfigurations, the resulting deployments can become unreliable.

This report presents a comprehensive reliability audit of 105 popular Helm charts. The goal is to identify how well these charts adhere to known best practices that improve uptime, resiliency, and safe operations in Kubernetes environments.

What do we mean by "reliability" in this context? Essentially, we looked for Kubernetes manifest settings that help applications survive disruptions, autoscale to handle load, and avoid common failure modes. These settings correspond to widely recommended practices such as configuring PodDisruptionBudgets, spreading pods across zones and nodes, defining CPU/Memory requests, enabling Horizontal Pod Autoscaling with sensible minimums, etc. When these features are properly set, applications are better protected against outages (both planned and unplanned).

How did we collect the data?

The Prequel Reliability Research Team (PRRT) evaluated a total of 105 Helm charts, selected to cover a broad range of popular open-source applications and infrastructure components. These include charts for observability tools (e.g. Grafana, Prometheus), databases and storage systems (e.g. MySQL, Elasticsearch), networking and security add-ons (e.g. NGINX Ingress, Calico), messaging systems (e.g. Kafka, Pulsar), machine learning tools, and more. Each chart was analyzed using both its default manifests and a minimal "HA‑capability" render to reveal what's supported when scaled.

What was our criteria?

We checked each chart against 10 key reliability criteria. Each criterion corresponds to a best-practice configuration that improves reliability. Each chart was rendered twice: once with default values (out-of-the-box) and once with a minimal "HA-capability" override (e.g., setting replicaCount/replicas ≥ 2 and enabling autoscaling/HPA when available). Capability-style criteria (PDB, TopologySpreadConstraints, HPA) are considered "preseConsider writing an overlay chart or a wrapper that adds missing pieces.

Note: Daemonset(DS)‑based charts do not use PDB, TopologySpreadConstraints, HPA, or Replicas in the same way as Deployments/StatefulSets. For DS we emphasize CPU/Memory Requests/Limits and Liveness. Interpret DS scores with this in mind.

The specific criteria audited were:

PodDisruptionBudget (PDB) - Does the chart define a PodDisruptionBudget for its pods? A PDB ensures that a minimum number of pods stay up during voluntary disruptions (like node drains or upgrades), so that maintenance events don't accidentally take down the entire application[2].
- a PDB limits how many pods can be offline at once, preserving availability[3].
Topology Spread Constraints (N/A for DaemonSets ;one pod per node; spread is implicit) - Does the chart use topologySpreadConstraints to spread replicas across nodes/zones? This feature prevents all pods from landing on the same node or zone. By distributing pods across failure domains, it reduces the blast radius - if one node or AZ goes down, it won't take out every replica[4].
- Topology spread constraints thus improve resiliency in multi-node or multi-zone clusters.
Horizontal Pod Autoscaler (HPA) (N/A for DaemonSets; HPA does not scale DS) - Does the chart include a HorizontalPodAutoscaler resource (or support enabling it via values)? An HPA will automatically adjust the number of pod replicas based on workload (CPU, memory, or custom metrics)[5].
- This ensures the application can scale out to handle surges in demand and scale back down to save resources, thereby preventing overload and maintaining performance during peak loads. In practice, an HPA with minReplicas ≥ 2 will ensure redundancy.
CPU Requests - Do the pods have CPU requests set? A CPU request reserves a certain amount of CPU for the container. Setting requests is vital because it lets the Kubernetes scheduler know the pod's needs, and prevents scheduling too many high-demand pods on one node.
- Without CPU requests, pods might be squeezed onto a node without guaranteed compute, leading to unpredictability. We treat absence of CPU requests as a reliability risk.
CPU Limits - Do the pods have CPU limits defined? CPU limits cap how much CPU time a container can use. This is important to prevent a single pod from monopolizing the CPU on a node.
- Left unchecked, a misbehaving pod can starve co‑located workloads (including system components), hurting overall cluster stability.
- Note: The use of CPU limits is debated; limits can introduce CPU throttling under load (sometimes even when usage appears within the configured limit), which may cause latency spikes. We still treat the presence of reasonable limits as a reliability factor because they improve predictability and blast‑radius control.
Memory Requests - Do the pods have memory requests set? Like CPU requests, memory requests ensure the scheduler gives the pod a guaranteed amount of RAM. This helps avoid scenarios where too many memory-hungry pods are placed on one node, which could lead to OOM (Out-Of-Memory) kills or node instability if memory is overcommitted. Charts should specify memory requests for reliability.
Memory Limits - Do the pods have memory limits? A memory limit puts an upper bound on memory usage for a container. This prevents a runaway process from consuming all available memory on the node.
- Without a memory limit, a single container can trigger out-of-memory conditions that crash itself or even the node. Setting memory limits thus contains faults and improves resilience.
Liveness Probe - Does the container include a liveness probe (heartbeat check)? Liveness probes allow Kubernetes to detect if an application has hung or crashed internally. If the liveness probe fails, Kubernetes will automatically restart the container[7][8].
- This self-healing mechanism is crucial for reliability, ensuring that issues like deadlocks or crashes don't go unnoticed.
Readiness Probe (Usually N/A for DaemonSets; often no Service)- Does the container include a readiness probe? A readiness probe signals when a pod is ready to serve traffic. Kubernetes will not send traffic to a pod (for example, attach it to a Service load balancer) until its readiness probe succeeds.
- This prevents sending requests to pods that are still initializing or are unhealthy[8]. In our context, having readiness probes means smoother rollouts and no premature traffic to unready pods, avoiding potential errors during startup or local disruptions.
- Note: readiness can also fail after startup to indicate momentary unavailability; the pod will not be restarted as long as the liveness probe continues to succeed.
PriorityClass - Does the chart assign a PriorityClass to its pods? Priority classes determine the priority of pods for scheduling and eviction. Using a PriorityClass for critical workloads ensures that in resource crunch scenarios, lower-priority pods won't displace or interfere with more critical pods[9].
- Essentially, it helps protect mission-critical applications from being preempted or starved by less important ones[9]. While not every application needs a custom priority, setting one for system-critical services can improve reliability during cluster stress.
- Note: we present PriorityClass as informational and do not score it by default. But for Daemonset based charts they can be considered a reliability constraint.

How did we score charts?

Each Helm chart was evaluated against the 9 scored criteria (all of the above except PriorityClass). For each scored criterion present, the chart earned 1 point. Thus, charts could score between 0 (none of the best practices present) and 9 (all scored practices present). We then categorized charts into reliability tiers based on their score: - "Reliable" - Score of 7 or above (i.e., implementing at least ~75% of the scored practices). These charts have most of the important safeguards in place. "Moderate" - Score of 4 to 6. These charts follow some best practices but lack others, indicating room for improvement. "High Risk" - Score of 3 or below. Such charts miss the majority of reliability features, likely making them fragile in real-world conditions.

It's important to note that a low score does not necessarily mean the software itself is bad; rather, it means the default Kubernetes manifests provided by the Helm chart might not ensure high availability or resilience. Users could still deploy those applications reliably by tweaking configurations (e.g., enabling HPA or increasing replicas), but out-of-the-box, the chart might expose them to more risk.

For this report, we opted not to add weights to the various safeguards.

Overall Reliability Scores

Across the 105 Helm charts evaluated, the distribution of reliability scores was skewed toward the lower end. The mean score was ~3.98 out of 10, and the median score was 4, indicating that typically a chart only implements around four of the ten recommended reliability measures.

This overall low average suggests that many popular Helm charts do not incorporate a comprehensive set of reliability features by default.

To put these results in perspective:

Only 18 charts (17.1%) scored in the "Reliable" tier, meeting 7 or more of the criteria. In fact, the highest score observed was 9/10 (no chart had a perfect 10). This means very few charts have almost all the best practices in place. The top scorers tend to be well-maintained projects that explicitly focus on robust deployments (examples are discussed later).
37 charts (35.2%) fell into the "Moderate" tier with scores 4-6. These charts include some reliability features but are missing many others. They might, for instance, have health probes configured but lack autoscaling and disruption budgets, or vice versa.
50 charts (47.6%) landed in the "High Risk" tier with a score of 3 or below.

Alarmingly, nearly half of the charts audited implement only a few (if any) of the reliability best practices. In fact, 10 charts scored 0/10, meaning they did not include a single one of the checked reliability features in their default manifests.

On the other hand, the relatively small fraction of charts in the "Reliable" tier demonstrates that it is feasible for a Helm chart to be shipped with strong reliability guardrails so this is an attainable goal for chart maintainers. The findings suggest that there's significant room for improvement across the board, and users should not assume a chart is production-ready just because it's popular.

In summary, the overall reliability state of Helm charts is middling to poor, with a heavy tail of charts lacking critical features. Next, we delve into which specific reliability practices are most often absent, and which are more commonly implemented.

What's Missing Most Often

Looking at the pass/fail rates for each of the 10 criteria gives insight into which reliability practices chart maintainers commonly omit. The following are the key criteria, ordered from most neglected to most adopted, along with the percentage of charts that failed each check in our audit (note: for capability-style criteria like PDB/TopologySpread/HPA, we count them as present if found either in the default render or under a minimal HA-capability render):

Topology Spread Constraints - 93% of charts do not specify any topology spread constraints. This was the most glaring gap: only about 7% of charts had this configuration. Essentially, almost all charts do nothing to ensure pods are spread across multiple nodes or zones.
- As a result, if you deploy these charts, there's a high chance all replicas could land on a single node by default, making the application vulnerable to node-level failures.
- Note: We didn't score podAntiAffinity. While it can reduce co-location, it often over‑constrains scheduling and can slow recovery during drains/outages. We focus on topologySpreadConstraints (more expressive and preferred) and may surface podAntiAffinity as informational in future iterations.
Horizontal Pod Autoscaler (HPA) - 75% of charts lack an HPA. Three out of four charts do not provide an automated scaling policy. This means by default those applications will run a fixed number of replicas regardless of whether the charts had this configuration. Essentially, almost all charts do nothing to ensure pods are spread across multiple nodes or zones.
- As a result, if you deploy these charts, there's a high chance all replicas could land on a single node by default, making the application vulnerable to node-level failures. (We detect constraints both at the top level (spec.template) and in pod templates ( spec.topologySpreadConstraints.)load.
- If there's a traffic surge or high workload, the application won't scale out to handle it, potentially leading to performance degradation or downtime. In our evaluation, an HPA with minReplicas ≥ 2 is considered to provide redundancy; the low inclusion rate suggests many charts expect users to enable scaling themselves.
PodDisruptionBudget (PDB) - 74% of charts do not define a PDB. This is another high-impact gap: roughly only one in four charts includes a PodDisruptionBudget. Without a PDB, there's no built-in protection against voluntary disruptions.
- The fact that most charts omit this means they are prone to downtime during routine operations like node rotation or cluster upgrades, unless the user manually adds a PDB.
CPU Limits - 63% of charts have no CPU limits on containers. Nearly two-thirds of charts do not cap CPU usage. While Kubernetes can still function without limits, the risk is that a container under heavy load could consume all CPU cycles on a node.
- This can cause noisy neighbor issues and even lead to other critical pods getting starved. The audit shows many charts leave this unchecked.
Memory Limits - 60% of charts lack memory limits. Similar to CPU limits, a significant portion don't set any max memory usage.
- Without memory limits, a memory leak or spike in one container can trigger an OutOfMemory condition on the node, potentially killing not just that container but others on the node as well. Memory limits contain the impact of such issues to the offending pod. The lack of limits in 60% of charts suggests a prevalent oversight, possibly because setting a one-size-fits-all memory limit is tricky and maintainers opt not to set any - but at the cost of reliability.
CPU Requests - 51% of charts do not declare CPU requests. About half of the charts don't reserve CPU for their pods.
- This means the scheduler doesn't account for their CPU needs explicitly, which can lead to packing too many CPU-intensive pods on a node. Not having CPU requests can also degrade the effectiveness of autoscaling (HPA) because the HPA's decisions often rely on knowing the CPU utilization relative to requests. The fact that ~49% do set CPU requests is a mildly positive sign.
Memory Requests - 49% of charts have no memory requests. This is in roughly the same range as CPU requests (just a hair better). About half the charts don't reserve memory.
- Without memory requests, the scheduler might place too many memory-hungry pods together. However, the other ~51% do set memory requests, which indicates that at least for half the charts, basic resource reservations are considered.
Liveness Probes - 20% of charts lack a liveness probe. Here we see a much better adoption: 80% include liveness probes.
- This is encouraging, it suggests that the majority of chart maintainers recognize the importance of self-healing for their applications. The 20% missing probes might be either very simple apps that don't need it (though almost every app benefits from a liveness check) or just oversights.
Readiness Probes - 15% of charts lack a readiness probe. This was the most well-adopted criterion: about 85% of charts have readiness probes configured.
- Many chart authors seem to prioritize this, as it directly affects user experience during deployments/updates.
- Readiness Probes - 15% of charts lack a readiness probe. This was the most well‑adopted criterion: about 85% of charts have
- Note: a portion of the remaining ~15% may not serve traffic directly (e.g., agents/Daemons without a Service, batch/cron jobs), so a readiness check may not be necessary for those workloads.
PriorityClass - 15% do not specify any PriorityClass (thus pods run at default priority). This criterion is a bit different from others because not every app truly needs a custom priority; it's more relevant for multi-tenant clusters or ensuring system-critical pods have higher priority.
- The low adoption isn't as alarming as the others - it likely reflects that most charts use the default priority (which is fine for many cases). The 15% that do set a PriorityClass are usually charts for important infrastructure components (like ingress controllers, logging agents, etc.) where maintainers deemed it necessary to ensure those pods are less likely to be evicted or preempted.

In summary, the most commonly missing features were topology spread constraints, PodDisruptionBudgets, and autoscaling, each absent in well over 70% of charts. On the flip side, readiness and liveness probes were well-adopted by ~4 out of 5 charts, indicating that basic health monitoring is largely in place. Resource requests/limits showed a mixed picture - about half the charts enforce them, half don't.

It's worth noting that some charts might intentionally omit certain measures/settings expecting the user to configure them (for example, an autoscaler might be left out if the application's scaling requirements vary widely between deployments). However, given Helm charts often aim to provide a reasonable default setup, it's generally better to include these reliability features disabled or set to sensible defaults (which users can override) than to leave them out entirely.

These findings highlight areas where chart maintainers could improve:

Implementing PodDisruptionBudgets would greatly enhance resilience during cluster maintenance.
Adding Topology Spread Constraints (even a simple zone spread) would add high availability for multi-node deployments.
Including an HPA (even off by default but available) would encourage autoscaling usage; setting minReplicas ≥ 2 under HPA provides redundancy.
Setting resource requests/limits (perhaps conservative defaults) would promote more consistent performance and avoid resource contention issues[6].
Exposing these controls as configurable values (with sensible defaults) raises the reliability baseline of the Helm ecosystem with minimal friction.

Reliability by Application Category

We grouped the charts into broad categories (based on the primary domain or function of the application) to see if certain types of applications tend to be more reliably configured than others. The categories included Monitoring/Logging, Security, Networking, Database, Storage, Streaming/Messaging, Integration/Delivery (CI/CD), AI/Machine Learning, and a few Uncategorized (for charts that didn't clearly fit a single domain).

There were some noticeable differences in average scores across these groups:

Streaming & Messaging - Charts in this category (e.g. Apache Kafka, Pulsar, RabbitMQ) had the highest average reliability score, around 5.3/10. This was the only category averaging above 5. A likely reason is that streaming systems are often stateful and critical, so their charts (especially Kafka's and Pulsar's) tend to incorporate features like PDBs and resource settings.
Databases - Database charts (for systems like MySQL, PostgreSQL, MongoDB, etc.) also scored relatively well, averaging about 4.4/10. This was among the higher averages. Databases are stateful and often require careful handling of downtime, so we saw that many database charts include things like PDBs and requests/limits.
Integration/CI-CD - This category (which includes things like Argo CD, Jenkins, GitLab Runner, etc.) had a moderate-to-high average around 4.2/10. It's a small sample size, but notably charts like GitLab and Harbor (artifact registry, which we placed under integration/delivery) had decent scores (Harbor was 7, GitLab 7).
AI / Machine Learning - This is an interesting category. We grouped various machine learning tool charts (like Kubeflow, MLFlow, etc.) here. The average was roughly 4.0/10, about on par with the global average. We had a mix: Kubeflow's official chart scored 8 (very good), but some others like smaller ML tools scored low.
Networking - Charts providing networking infrastructure (e.g. ingress controllers like NGINX, CNI plugins like Calico, service meshes, etc.) averaged around 3.9/10. This is just below the overall average. Many networking-related charts turned out to be missing a number of best practices. It's somewhat surprising because one would expect networking components to be critical; the low scores might be because some networking daemons run as DaemonSets or have non-standard setups that our criteria didn't fully apply to (or were just not configured with those features).
Security - Charts for security tools (like Falco, cert-manager, external-secrets, etc.) averaged roughly 3.7/10. This was on the lower side. Notably, Falco's chart scored 0 (it lacked all the features), which is a big red flag since Falco itself is a tool for security monitoring. This suggests reliability configuration hasn't been a focus in some security tool charts, perhaps they assume a skilled operator will deploy and tune them, or just oversight.
Storage - Storage system charts (e.g. Longhorn, OpenEBS, MinIO, etc.) were also below average, at around 3.6/10. It's somewhat concerning because storage systems are stateful and critical; one might hope their charts are highly robust.
Monitoring & Logging - This was the lowest-scoring category, averaging about 3.36/10. It also had the largest number of charts (since there are many monitoring/logging tools). A significant number of charts here had poor scores. Grafana Loki, however, was an outlier with a 9 (which helped a bit). It's possible maintainers assume these tools run with a certain redundancy externally, or they simply haven't prioritized the reliability of the monitoring system itself. The irony is that tools used to monitor reliability of other apps were themselves often not configured reliably by default.
Uncategorized - We had a small set of charts we labeled uncategorized (miscellaneous). Their average was around 5.1/10, interestingly high. This bucket included things like some operator frameworks or bundles that didn't fit elsewhere.

Overall, these category-based observations indicate that stateful services (databases, streaming) tend to have better reliability setups than many operational or add-on tools (monitoring, security). One reason could be that stateful applications demand careful handling (you can't just casually restart a database without thinking of data consistency, etc.), so chart authors had to incorporate protections like PDBs. Meanwhile, things like metric collectors or log shippers, while also important, might be seen as easier to redeploy and thus chart authors were less strict about adding budgets or spreads.

These differences highlight that if you are deploying certain types of applications, you should be especially vigilant. For example, if you deploy a monitoring stack, double-check its chart for missing reliability configs (our data suggests it's likely missing a few).

In short, while no category was perfect, some domains clearly lag in reliability configuration (monitoring/logging and some infra tools), and users should plan to fortify those charts themselves.

Top Performing Charts (Examples of Good Practice)

Despite the generally low scores overall, we identified a set of charts that serve as positive examples of how to package an application for reliability. These top performers managed to include most of the recommended best practices. Here are a few notable ones:

Grafana Loki - Score: 9/10. This was the highest scoring chart in our audit. Loki (a log aggregation system) had all but one of the criteria present. It defined resource requests/limits, had both probes, included an HPA, set a PodDisruptionBudget, and even topology spread constraints.
Apache Kafka - Score: 8/10. Kafka is a critical streaming platform and its Helm chart scored very well. Included PDBs for brokers, resource requests/limits, and liveness/readiness probes.
Keycloak - Score: 8/10. Keycloak (an identity management service) also was configured with many best practices. It had health probes, resource management, PDB, etc.
Pulsar - Score: 8/10. Apache Pulsar, another streaming platform, did excellently as well. Similar to Kafka, its chart includes comprehensive settings. The Pulsar chart actually consists of multiple components (broker, zookeeper, bookkeeper), each handled carefully with appropriate configs.
Kubeflow - Score: 8/10. Kubeflow (the machine learning toolkit) had an official chart that scored high. This is interesting because Kubeflow is a very complex system. Multiple services configured with PDBs, resource requests/limits, and liveness/readiness probes; HPA is supported on components that can scale.
Sentry - Score: 8/10. Sentry (error tracking platform) also was among the top. Sentry being an operational tool that teams rely on, it's good that its helm chart tries to keep it highly available (for example, ensuring the web and worker pods have proper probes and budgets).
GitLab - Score: 7/10. GitLab's chart (particularly the omnibus or the cloud-native GitLab chart) scored in the reliable tier as well. Given the number of sub‑components, this reflects broad coverage of probes, resource controls, and PDBs, with two criteria not satisfied in our scoring.
OpenEBS - Score: 7/10. OpenEBS (a storage orchestrator) was a bright spot in the storage category, scoring 7. It Included PDBs (important for data pods) and resource controls; a stronger showing within the storage category. It stands in contrast to Longhorn's chart (which scored 0), showing not all storage projects neglect reliability.
Harbor - Score: 7/10. Harbor (container registry) is another complex application that scored well. PDBs and resource settings were present across core components (database, core, job service, etc.).

These top charts illustrate that high reliability scores are achievable. They typically come from either: well-known companies/communities that enforce good devops practices in their charts (e.g., Grafana on Loki), or inherently critical software whose maintainers know the users will demand a resilient setup (databases, security/auth services, etc.).

Poorest Scoring Charts

On the other end of the spectrum, we saw quite a few charts with minimal or no reliability features. It's important to highlight some of these not to single them out for blame, but to illustrate common patterns of omission and to caution users of these charts to take extra care. Here are a few of the lowest-scoring examples:

Falco - Score: 0. Falco is a popular security monitoring tool (runtime security). Shockingly, its Helm chart did not include any of the checked reliability configurations. Users of Falco should be aware that they might need to add their own reliability settings. It's a classic case where perhaps the focus was on the security functionality of the app, and the chart packaging received less attention to reliability.
- Note: Deploys as a DaemonSet by default; PDB/TopologySpread/HPA/Replicas are N/A.
Longhorn - Score: 0. Longhorn is a cloud-native distributed storage solution. A zero score here is concerning because storage systems are complex and not having (for example) a PodDisruptionBudget for the storage pods could lead to data unavailability during maintenance.
- Note: Mixed workloads (Deployments/StatefulSets/DaemonSets). DS components won't use PDB/Spread/HPA;
Fluent Bit - Score: 0. Fluent Bit is a log forwarding agent. Its chart scoring 0 indicates it likely runs as a DaemonSet with no added frills. Logging pipelines are critical for SRE observability, so keeping them reliable is important.
- Note: Deploys as a DaemonSet by default; PDB/TopologySpread/HPA/Replicas are N/A.
Calico (Tigera Operator) - Score: 0. Calico is a networking CNI for Kubernetes, and it often runs via an operator.
Grafana Alloy - Score: 0. Grafana "Alloy" is a lesser-known component (possibly a plugin or sidecar). It scoring 0 again emphasizes that even within a known vendor's ecosystem, not every chart is equal - Loki's was excellent, but this one was not.
- Note: Typically runs as a DaemonSet for node telemetry; apply DS caveats (PDB/TopologySpread/HPA/Replicas are N/A).
Prometheus Node Exporter - Score: 0. The node-exporter chart (listed as part of prometheus-community) also had none of the reliability features. Node-exporter runs as a DaemonSet on each node to collect metrics. Similar to Fluent Bit, running as DaemonSet might have led maintainers to not include budgets or autoscaling (since those don't apply the same way)
- Note: Deploys as a DaemonSet by default; PDB/TopologySpread/HPA/Replicas are N/A.
Actions Runner Controller - Score: 0. This is a chart for a GitHub Actions self-hosted runner controller. It scoring 0 means it lacks any reliability config; as an operator-like component, it probably wasn't given PDBs or special priority.
AMD GPU and Intel GPU plugins - Score: 0. We saw charts for GPU device plugins (for AMD and Intel GPUs) also with zero scores. These are deployed as DaemonSets to advertise GPUs to the cluster. They had no reliability features in charts, which might be because they're expected to be super lightweight.
- Note: Deploys as a DaemonSet by default; PDB/TopologySpread/HPA/Replicas are N/A.

In total, we had 10 charts with 0 score (some we described above). Many others were just slightly above 0 (score 1 or 2).

Common patterns among low-scoring charts:

Many are infrastructure add-ons (operators, agents, plugins) rather than end-user applications. It seems chart maintainers for these system-level tools often keep the chart minimal.
DaemonSet workloads -- Several 0‑score charts are DaemonSets (e.g., Fluent Bit, node‑exporter, GPU plugins). For DS, controls like PDB/TopologySpread/HPA/Replicas are generally not applicable; what matters is CPU/Memory requests and limits, Liveness probes, PriorityClass, and DS rollingUpdate settings.
Relatively newer or niche projects - Some low performers are not as mature or widely used, perhaps, so their charts haven't undergone rigorous production-hardening by the community.

For users, the takeaway is:

if you are using one of these low-scoring charts (or any chart that hasn't clearly advertised its reliability features), do not deploy it blindly in production.

At a minimum, consider:

- Checking carefully the values to define PodDisruptionBudget, Topology Spread Constraint and HPA wherever applicable.

- Setting resource requests/limits via values overrides.

- Adding an HPA (if the app would benefit from scaling).

- Ensuring you attach liveness/readiness probes (maybe via chart values or a side patch if not supported natively).

- If it's a critical infrastructure component, possibly assign it a PriorityClass to avoid eviction (for example, set it to the system-cluster-critical priority if appropriate, or create a custom one).

Basically, use the findings here as a checklist against any Helm chart you deploy: check if it has these items, and if not, you might need to supply them.

Conclusion and Recommendations

This reliability audit of Helm charts has revealed a sizable gap between the reliability best practices that we know are important and what is actually implemented in many Helm charts today. While a few charts exemplify excellence in configuration, the majority have significant room for improvement. At the same time, we acknowledge that different use cases require different safeguards.

In closing, we summarize recommendations for both chart users (operators) and Helm chart maintainers and to act on these insights, and how tools like Preq/Prequel can assist in catching unmitigated risks.

For Chart Maintainers:

Embed Reliability Best Practices by Default: If you maintain a Helm chart, consider this an encouragement to bake in more of these features. These should be part of the "standard equipment" of your chart.
Provide Autoscaling Options: Where applicable, provide an HPA in the chart (it can be off by default, but ready to enable). This signals to users that your application supports scaling and encourages them to use it.
Don't Skimp on Probes: Ensure every long-running pod has a liveness and readiness probe defined. This is one area many charts did well, and all should. It significantly increases resiliency
Set Resource Requests/Limits: We recognize that picking default resource values can be tricky (since workloads differ), but providing reasonable defaults is better than none.
Use PriorityClass for Critical Components: If your chart is for a system-critical service (operators, controllers, ingress, etc.), consider assigning a high PriorityClass (or at least make it configurable).
Test Disruption Scenarios: As a maintainer, test how your chart behaves during common scenarios: node drains, upgrades (in case of stateful applications if someone does helm upgrade, do all pods restart at once?), high load (does CPU spike and if so, is HPA there to help?), etc. This experiential testing can highlight missing pieces.
Leverage Community Standards: The Kubernetes community (and projects like the CNCF) often provide guidelines or even boilerplate for these configurations. Following these guides can serve as a checklist for your chart.
Document appropriately: Clear and comprehensive documentation is essential for helping end-users configure reliability settings that align with their specific needs and operational constraints. Good documentation not only enhances the experience for maintainers but also ensures a smooth experience for end-users.

For Helm Chart Users (Deployers):

Review Charts Before Production Use: Do not assume a Helm chart is production-ready. As this audit shows, many are not, in terms of reliability. Before deploying, audit the chart's values and manifests yourself.
Override and Augment Configurations: The beauty of Helm is you can supply custom values. Use this to your advantage.
Consider writing an overlay chart or a wrapper that adds missing pieces.
- Note: overlays/wrappers break the "least knowledge" principle. You can't rely solely on upstream SemVer/upgrade notes. Treat it like a maintained fork.
Contribute Back Improvements: If you as a user had to add reliability configs to make a chart stable, consider contributing that back to the chart's repository (submit a pull request or issue). Prefer upstreaming first; use overlays only when upstream changes are not feasible in the short term.
Use Detection Tools: Consider using tools that can scan your cluster or manifests for missing best practices, essentially doing what this audit did, but for your environment. Tools like Prequel (start for free) can be integrated into CI pipelines or run against your Helm releases continuously to flag if, say, a new deployment is missing a liveness probe or PDB. In an enterprise setting, Prequel could be set up as a guardrail: whenever a new chart is deployed, it checks the CRE rules and alerts if something critical is absent. This kind of automation ensures that even if a maintainer hasn’t provided a feature, you catch it before it causes an incident.

The Role of Continuous Reliability Scanning: Finally, it's worth re-emphasizing the value of continuous monitoring using tools like Prequel. Just as security scanning of images and CVEs has become a standard part of DevOps, reliability scanning is emerging as a complementary practice. By leveraging the [CRE rule set] (which encapsulates knowledge of failure patterns), teams can detect misconfigurations early.

The journey to reliable Kubernetes deployments is a shared responsibility between chart creators and users. Helm charts are a powerful vehicle for distributing applications, but they should carry not just the app itself, but also the wisdom of running it reliably. Our research is designed to bring awareness to the risk so that teams can fully embrace that philosophy. By implementing the recommendations above, we can collectively raise the reliability bar.

[1] Prequel Documentation‍
[2] [3] [4] [7] Post 2/10 — Reliability by Design: Probes, PodDisruptionBudgets, and Topology Spread Constraints - DEV Community‍
[5] Services, Load Balancing, and Networking‍
[6] [8] Kubernetes Best Practices for Reliability‍
[9] IBM Cloud Docs‍
[10] Horizontal Pod Autoscaling | Kubernetes

Community: The 100% Open-Source AI Stack That Automates My Business, and Tricks for Troubleshooting It

Lyndon Brown — Wed, 08 Oct 2025 22:26:13 +0000

Most teams don't need a full-fledged "AI platform" to automate everyday workflows. You need a reliable way to trigger automation jobs, call a model, keep a little context, and deliver results. That's it.

In this guide I'll show my practical, open-source AI stack for automating my business workflows. For me, it was important for my stack to be 100% open source and self-hosted. There are certainly shortcuts you can take with commercial LLMs or other components.

In short, I landed on n8n for orchestration, Ollama for LLM server, and Postgres + pgvector for memory/RAG.

I also developed a simple deployment playbook with k8s extensibility and tips for keeping it reliable and healthy as you go.

Why an Open-Source stack

If you're handling customer data, internal docs, or anything with compliance implications, running locally or on your own cloud removes a ton of uncertainty. With this stack, your prompts, retrieved snippets, and outputs live inside your network, not on somebody else's infrastructure. You can log exactly what you want, keep tokens from leaking, and reason about costs in traditional terms (CPU/RAM/disk) instead of foggy per-token math.

There's also the control aspect. Open tools are swappable. Don't like the model? Pull a different one. Need to move from a laptop to on-prem? Bring your Kubernetes manifests and go. And because these projects are community-driven, you can read the code, file issues, or extend them when you bump into edge cases, no waiting for a vendor.

Oh, yeah, and for me cost mattered here.

A simple open-source AI stack that just works

At a high level, the flow is straightforward: a trigger fires in n8n (a webhook, a cron, or an event); you fetch context from APIs or files; you optionally search prior knowledge in Postgres + pgvector; you call the model via Ollama; then you save results and push them to Slack, email, the CRM, or a dashboard. That single loop covers a surprising amount of basic business flows like:

Sending a daily team update
Sorting and replying to customer support requests
Enriching lead details before they go into the CRM
Turning meeting transcripts into notes, or
Answering questions about company policies

Why these three?

n8n gives you a visual, testable pipeline with retries, branching, and a good set of built-in nodes. You can keep 90% of the logic declarative and drop into a "Code" node only where it helps.
Ollama is one of the most popular low-friction platforms to serve local models: one command to pull, a tiny HTTP API to call. No GPU required for small/quantized models; if you do have one, it'll happily use it.
Postgres + pgvector turns your database into a memory store. You can keep workflow state, results, and embeddings in one place, with real indexing and backups, no extra vector service to run.

Do you really need LangChain / AutoGPT / vLLM?

These are great AI tools, but you probably don't need them to add value to your workflow on day one.

LangChain is great if you're building complex chains, tool routing, or evaluators as a codebase of their own. If your flows are mostly linear ("get data - retrieve context - prompt - send"), n8n's nodes plus a small Code step are usually simpler to reason about and maintain.
AutoGPT (or other agent frameworks) is useful when the task is genuinely open-ended and needs autonomous planning ("research X, compare Y, produce Z unless blocked"). For business automations, most tasks are bounded: summarize, classify, extract, transform. Agentic loops can add latency and instability you don't need yet.
vLLM is a fantastic serving stack when throughput, long context, or GPU batching are your constraints. If you're running a handful of concurrent automations, Ollama is much easier operationally.

Rule of thumb: begin with n8n + Ollama + pgvector. If a real bottleneck appears, too many concurrent requests, prompts that need long contexts, or tasks that require autonomous planning then layer on the specialized tool that solves that bottleneck and nothing else.

My Workflow Architecture

In practice you'll add a few niceties:

Keep prompts tidy: write down a handful of prompt templates you actually use (for summaries, classifications, replies, etc.) and save them in Postgres.
Avoid duplicates: if you run daily jobs, store a hash of the input. If you've already seen it, skip re-processing. This saves time and avoids sending the same Slack message or email twice.
Log what happens: after each step, record start time, end time, and status in the database. When something feels slow or fails silently, you'll have a clear history instead of guessing.

Run it locally (Docker Quick Start)

You can run the stack locally in 10 minutes with Docker. We'll run everything in containers. Postgres + pgvector, n8n, and Ollama, so n8n can call the model at http://ollama:11434 on the internal Docker network.

Prerequisites

Docker Desktop (or Docker Engine) with Compose v2
Optional: psql client for quick DB checks

1) Project layout

mkdir -p ai-stack/{pg-data,n8n-data,init}
cd ai-stack

Create two files:

docker-compose.yml

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_USER: aiuser
      POSTGRES_PASSWORD: supersecret
      POSTGRES_DB: ai
    ports:
      - "5432:5432"            # change left side if 5432 is busy on host
    volumes:
      - ./pg-data:/var/lib/postgresql/data
      - ./init:/docker-entrypoint-initdb.d
    healthcheck:
      test: ["CMD-SHELL","pg_isready -U aiuser -d ai"]
      interval: 5s
      timeout: 5s
      retries: 20
    restart: unless-stopped

  n8n:
    image: n8nio/n8n:latest
    ports:
      - "5678:5678"            # change left side if 5678 is busy on host
    environment:
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_PORT=5432
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=aiuser
      - DB_POSTGRESDB_PASSWORD=supersecret
      - N8N_BASIC_AUTH_ACTIVE=true
      - N8N_BASIC_AUTH_USER=admin@example.com
      - N8N_BASIC_AUTH_PASSWORD=changeme
      - N8N_DIAGNOSTICS_ENABLED=false
      - N8N_RUNNERS_ENABLED=true
    depends_on:
      postgres:
        condition: service_healthy
    volumes:
      - ./n8n-data:/home/node/.n8n
    restart: unless-stopped

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"          # change left side if 11434 is busy on host
    volumes:
      - ./models:/root/.ollama # cache models on your disk
    healthcheck:
      test: ["CMD","/bin/sh","-c","curl -sf http://localhost:11434/api/tags || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 30
    restart: unless-stopped

init/00-init.sql

-- DB for n8n to store workflows/executions
CREATE DATABASE n8n WITH OWNER aiuser;

-- Enable vector in 'ai' DB and a tiny log table
\connect ai
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS llm_runs (
  id BIGSERIAL PRIMARY KEY,
  prompt     TEXT NOT NULL,
  model      TEXT NOT NULL,
  response   TEXT NOT NULL,
  latency_ms INTEGER,
  created_at TIMESTAMPTZ DEFAULT now()
);

Bring the stack up:

docker compose up -d

2) Pull a small model (inside the Ollama container)

docker exec -it ollama ollama pull llama3.2

Sanity checks:

list models (in container)

docker exec -it ollama curl -s http://localhost:11434/api/tags | jq .

check pgvector is enabled

psql -h localhost -U aiuser -d ai -c "SELECT extname FROM pg_extension;"

Open n8n at http://localhost:5678. On first visit, you'll see a basic-auth prompt (admin@example.com / changeme), then n8n will ask you to create the Owner account (email/password).

Reliability Issues - what it breaks and how to stay ahead of it quickly

Self-hosting open source projects definitely comes with its own challenges. You are on the hook to keep it up and running. There is no support team to call.

Here are some of the gotchas I ran into and how I handle them.

Tool Reliability gotchas & fixes

Tool	Issues & Fixes
n8n	- Long LLM calls can hit step timeouts which split into smaller steps; add retries with jitter. - Webhooks may drop chunky payloads so raise body-size limits at ingress/reverse proxy. - Retries without persistent storage can lose inputs, persist inbound payloads to Postgres first, then process. - Monitor execution latency and failure rates; limit per-node timeouts sensibly.
Postgres + pgvector	- Queries failing because pgvector not enabled; Run CREATE EXTENSION vector; once per database. - Vector searches feel slow; Add a vector index (IVFFLAT or HNSW), keep your top-k small (say 5–20), and do routine VACUUM/ANALYZE. - Too many connections causing random timeout; Add PgBouncer in front of Postgres and keep n8n's pool small.
Ollama	- First-run model pulls are slow/flaky; Pre-pull models on deploy and a tiny warm-up prompt after start - Runs out of memory on laptops/containers; Use smaller/quantized models, keep prompts short, and give Docker enough RAM/swap. - Port/DNS weirdness; Remap the host port in Compose file. - "Model not found" errors; Check /api/tags to see what's loaded, and ollama pull on container startup

Detect issues early with community CREs (and preq)

The reliability community has started to maintain a community-driven catalog of failure patterns called CREs(Common Reliability Enumerations) so you don't have to rediscover them in production. Each CRE includes detection as code.

You can run these with preq (pronounced preek), the open source reliability problem detector, to turn noisy logs into clear, actionable signals.

The public catalog covers dozens of popular technologies, from Kubernetes and databases to application runtimes. Here are a few relevant ones my AI stack or similar ones:

CRE	Technology	What it detects	Quick mitigation
CRE-2025-0179	n8n	Items disappear between nodes in long workflows, causing silent data loss and incomplete runs.	Add item-count checks, enable detailed logging, split long workflows, and use error workflows to capture failures.
CRE-2025-0200	AutoGPT	AutoGPT loops while debugging itself, consuming tokens and memory until it crashes.	Detect loops, cap depth and tokens, add circuit breakers, and auto-stop runaway agents.
CRE-2025-0137	Kubernetes Runtime	A pod exceeds its memory limit and is killed, often showing up as CrashLoopBackOff.	Increase memory limits, profile for leaks, tune heap/GC, use VPA, and add memory alerts.
CRE-2025-0071	Kubernetes	CoreDNS has no healthy pods, leading to cluster-wide DNS failures.	Check pods and logs, scale replicas, restart rollout, and confirm kube-dns endpoints.
CRE-2025-0077	Postgres	Postgres cannot extend files because the disk is full, blocking new writes.	Free space, clean old data, run vacuum, expand disk size, and set up disk usage alerts.

You can wire detections to Slack, email, or even Jira with a short runbook ("what it means" and "what to do") to make fixes faster.

If you are setting up this stack, or your own, set aside 10 minutes to download preq and run it. It's open source, and it will save you from the dreaded "why did this silently fail?" mornings. Begin with the n8n data-loss rule, a basic Kubernetes exit code/DNS rule if you are running clusters, and a Postgres disk or connections check. You can always add more as your workflows grow.

👉 Explore the full CRE catalog; try out preq, and if you find it useful, don't forget to ⭐ the repo to support the community. : )

References

n8n – docs: https://n8n.io/docs
Ollama – repo: https://github.com/ollama/ollama
Postgres + pgvector – https://github.com/pgvector/pgvector
Kubernetes – https://kubernetes.io/docs/home/
CREs (Common Reliability Enumerations) – https://github.com/prequel-dev/cre
Preq (run CREs) – https://github.com/prequel-dev/preq

How I find and fix Kubernetes Exit Codes and Misconfigurations for free

Lyndon Brown — Fri, 12 Sep 2025 14:37:15 +0000

Kubernetes is powerful, but troubleshooting issues in a live cluster can be painful. In a complex deployment, critical warning signs often hide in thousands of log lines and events. What if we could surface these reliability issues before they take applications down?

Preq (pronounced "preek") is an open-source tool that brings a proactive approach to Kubernetes troubleshooting. It is a reliability problem detector that checks your cluster's logs, events, and configurations against a community-driven catalog of failure patterns [1]. Using Preq, you can monitor your cluster and catch misconfigurations, anti-patterns, or bugs early, instead of discovering them during a 2 AM incident [1].

Installing preq via Krew

preq is distributed as a kubectl plugin, making it easy to install through the Kubernetes Krew plugin manager. First, ensure you have Krew set up (if not, install it from the official docs). Then install Preq with a single command:

kubectl krew install preq

Within seconds, the plugin is ready to use[1]. There's no extra configuration needed. Preq ships with the latest common reliability enumeration (CRE) rule packages baked in. It auto-updates its rules so you're always scanning for the newest issues.

Running Kubectl preq from the CLI

Once installed, you can run Preq directly via kubectl to check various Kubernetes resources and their logs:

Pods: Scan an individual pod's logs and related events. For example, kubectl preq my-pod-abc123 will fetch that pod's logs and events, then compare them against the CRE rule library. [1]

Services: Running kubectl preq service/my-service triggers Preq to assess the pods behind that Service. While Services themselves don't have logs, Preq will identify the endpoints/pods for the service and check their logs and events for known issues.

Jobs and CronJobs: Run Preq on a Job or on pods created by a CronJob to inspect execution logs and events[12].

Under the hood, the Preq plugin uses Kubernetes APIs. This means you can run Preq on any resource type that has associated logs or events, giving you a flexible "detective" for your cluster.

Using Preq with ConfigMaps and Events

The current release of Preq primarily targets logs and manifests, but you can also leverage it for configuration files and cluster events with a little creativity.

ConfigMaps

Directly scan a ConfigMap with the plugin:

kubectl preq -n <namespace> configmap/<name-of-config-map>

Kubernetes events

Use this feeder to stream a timestamp and the raw event into Preq:

kubectl get events -A -o json | jq -r '.items[] | "\(.metadata.creationTimestamp) \(tojson)"' | kubectl preq

Other workload configurations beyond ConfigMaps

Use this workaround to feed deployments and similar manifests as compact JSON, stamped with a single UTC timestamp per line:

kubectl get deploy -A -o json | jq -c . | sed -e "1s/^/$(date -u +"%Y-%m-%dT%H:%M:%SZ") /" | kubectl preq

Example CREs

We'll highlight a few Common Reliability Enumerations created by community members[2][3][4][5]:

CRE	What breaks	Signals you will see
CRE-2025-0119	Too many pods down during an update	Rollout stalls, unavailable replicas, PDB budget exceeded
CRE-2025-0071	Cluster DNS resolution fails when CoreDNS has no ready pods or endpoints	CoreDNS availableReplicas at zero, kube dns endpoints empty, pods in CrashLoopBackOff, CoreDNS logs show errors
CRE-2025-0048	Worker node enters NotReady because control plane cannot resolve the node's FQDN	Node status shows NotReady without resource pressure, control plane logs may show hostname resolution errors
CRE-2025-0125	Kubelet crashes under rapid pod launches causing node NotReady and full node level outage with pod evictions	Node NotReady, mass pod evictions and rescheduling, kubelet logs show panic in EventedPLEG evented.go

Exit Code CREs: Crashes and Their Causes (137, 127, 134, 139)

Now let's talk about a different kind of problem: when your containers keep exiting with mysterious status codes. Preq includes CRE rules(7) for common exit codes to help pinpoint why a container crashed. Let's break down the usual suspects:

Exit Code 137

This typically means the process was killed with SIGKILL, which in Kubernetes often implies an out-of-memory kill. In other words, the container was using more memory than allowed, so the OS OOM killer terminated it[6][7]. It can also happen if someone manually kill -9 the process, but OOM is the usual cause. In Kubernetes you'll often see "Reason: OOMKilled" in the pod's status if this is the case.

Why it happens: Your app exceeded its memory limit or the node ran out of memory.

What to do: Check the container's memory limits and usage. You can run kubectl top pod to see if it was using a lot of memory. Increasing the memory limit (or request) for the container can prevent the OOMKill, or optimize the application to use less memory. Preq can help by flagging frequent OOM kills so you know to take action before it impacts users.

Exit Code 127

This means "command not found". The process tries to execute a file or command that doesn't exist in the container's filesystem [8][9]. It's a common error when the container's start command or entrypoint is misconfigured.

Why it happens: Either the binary isn't installed, the path is wrong, or a dependency is missing. It can also be a shell quoting issue or a file permission problem, but usually it's a missing executable.

What to do: Describe the pod (kubectl describe pod), often Kubernetes will log a message about "command not found" in the events. Fix the command in your container spec or Dockerfile. Make sure the image has the expected program at the correct location. Preq can catch this by scanning events or termination messages for "exited with code 127" and common error text. The solution is usually straightforward: install the missing tool or correct the command path.

Exit Code 134

This indicates the process received SIGABRT (abort signal)[10]. In plainer terms, the application crashed itself – often due to an internal error like an assertion failure or a call to abort(), or it was terminated after a fatal error.

Why it happens: Common causes include bugs (like asserting false or invalid memory access that is caught leading to abort), or sometimes out-of-memory in a different way. Another cause can be hitting a resource limit that triggers an abort.

What to do: Check the container's logs for any error messages or stack traces right before it exited. Often you'll see a line about an assertion or fatal error. Preq will highlight the occurrence of a 134 exit and can point out if it's a known pattern. Ensure you're not hitting known bugs in the app version, and consider adding liveness probes, if a container aborts frequently.

Exit Code 139

This is the infamous segmentation fault (SIGSEGV)[10]. The process tried to access memory it shouldn't (invalid pointer, buffer overflow, etc.), and the OS killed it. This is almost always a bug in the application code (or a library it's using).

Why it happens: A segfault can be caused by many things: using a null pointer, reading/writing out of bounds, incompatible native libraries, etc. In some cases, even running out of stack can cause a segfault.

What to do: As with 134, the primary action is to check application logs or enable core dumps for debugging. If the segfault happens on startup, it could be an incompatibility (for example, wrong CPU architecture or missing dependencies causing a segfault). Ensure the image is built for the correct architecture. Preq's rule for 139 will basically alert you that a container hit SIGSEGV. It can't fix the code, but it ensures you notice the crash.

Detecting exit codes with a unified command

You can quickly scan your cluster for any pods that terminated with these exit codes using this exact raw text feeder to scan for exit codes and pipe into Preq:

kubectl get pods --all-namespaces -o json | jq -r '
.items[] as $p
| [ ($p.status.containerStatuses // []),
($p.status.initContainerStatuses // []),
($p.status.ephemeralContainerStatuses // []) ]
| add
| .[]
| (.lastState.terminated // .state.terminated) as $t
| select($t != null and $t.exitCode != null and $t.finishedAt != null)
| [ $t.finishedAt,
($p.metadata.namespace + "/" + $p.metadata.name),
.name,
($t.reason // ""),
($t.exitCode|tostring) ]
| @TSV' | preq

Conclusion: Using Preq in Your Daily Workflow

In an ideal world, you can catch problems before they cause downtime and that's where Preq shines. Adopting preq in your day-to-day Kubernetes workflows can significantly reduce mean-time-to-detection for issues:

CI/CD Integration: Consider running Preq as a post-deploy check in your continuous deployment pipeline. For example, after deploying a new version of an application, have a step that runs kubectl preq on that namespace or on the specific new pods.

Proactive scheduled runs: Use kubectl preq -j to generate a Kubernetes CronJob template. It writes cronjob.yaml. Open the file, set the schedule, add the Preq command you want to run including any -a action and -o output, set the namespace, then apply it with kubectl apply -f cronjob.yaml

Note that kubectl preq does not support an all namespaces flag. To scan many targets, pass a data source template to Preq or wrap multiple invocations in a small script that the CronJob runs.

Post-mortem and Continuous Improvement: After any incident or outage, consider writing a new CRE rule (and contributing it!) if it was a novel issue. Preq's framework lets you codify that knowledge so that neither you nor anyone else gets bitten by the same problem twice.

In summary, Preq is a powerful ally for Kubernetes users. It turns the wealth of community experience with failure modes into actionable insights you can run on-demand. By incorporating Preq into CI/CD pipelines, scheduled scans, and troubleshooting sessions, you can proactively detect and resolve issues – often before they turn into user-facing incidents. Happy monitoring, and may your clusters run clean and healthy!

If you're looking for enterprise features such as:

a distributed detection engine that runs across many nodes and clusters
a web UI with guided workflows for investigation and collaboration
deeper integrations (for incident tracking, etc.)
a control plane for managing the distributed engine
a larger, proprietary set of CRE rules maintained by the Prequel Reliability Research team (PRRT).

Check out Prequel, our commercial offering and let us know what you think!

References

[1] Dev.to: 10 kubectl Plugins That Help Make You the Most Valuable Kubernetes Engineer in the Room

[2] CRE-2025-0119 | Prequel

[3] CRE-2025-0071 | Prequel

[4] CRE-2025-0048 | Prequel

[5] CRE-2025-0125 | Prequel

[6] Stack Overflow: Kubernetes Pods Terminated Exit Code 137

[7] Exit Code CREs | Prequel

[8] Installing Krew

[9] Schedule preq to run in a Cronjob

Bitnami’s Free Catalog Says Goodbye: Avoid Brownouts and a $72k Surprise

Lyndon Brown — Wed, 10 Sep 2025 18:01:10 +0000

tldr

Bitnami is narrowing public access to images and pausing updates to many chart artifacts. Expect brownouts as the cut-over window starts on Aug 28th with final public catalog deletion on Sept 29th
The biggest risks:
- Kubernetes ImagePullBackOff on restarts or during autoscaling,
- Stale/unpatched images (CVE drift),
- Chart drift and subchart dependencies that break upgrades.
We're publishing CREs (Common Reliability Enumerations) that help you quickly identify Bitnami-related risks and resolve them.

Credit: stonesabe4 on reddit

Why Bitnami mattered

For years, Bitnami's images and Helm charts were the de-facto path to running popular apps on Kubernetes. Well-maintained images, sensible defaults, and easy Helm installs. Many teams pinned Bitnami images in deployments, CI pipelines, and internal charts.

What's changing

Bitnami is making a number of changes following their acquisition by Broadcom and renewed focus on a subscription model.

Catalog changes

The container repos are undergoing a major shift:

The existing docker.io/bitnami public repo will be deleted
A new repo docker.io/bitnamisecure will contain hardened community images, but there is a catch. It will only contain the latest tags and these images are intended for development only
Existing container images will be moved to a new repo docker.io/bitnamilegacy, but will receive no further updates

Charts stop updating

Bitnami's Pre-built Helm chart artifacts won't be updated anymore, so their defaults keep pointing to old images; you'll need to override image repos/tags or adopt alternatives.

Brownouts & cutoff windows

Bitnami has planned 24-hour outages for selected images. For each scheduled brownout, ten container images from docker.io/bitnami will be taken offline for a 24-hour period. The specific applications impacted will be shared on the day the brownout begins. Final cutoff will occur on Sept 29.

Bitnami Repo Deprecation Timeline

Who's affected

If you use any of these, read on:

Pinned or Unpinned Bitnami image tags (e.g., docker.io/bitnami/postgresql:13.x, :latest) in Deployments/StatefulSets/Jobs
Bitnami-based charts in helmfile/Argo CD/Flux pipelines
CI pipelines that pull Bitnami tools (kubectl, kubectl-helm, db images, etc.)

What will the impact be?

Kubernetes ErrImagePull / ImagePullBackOff on pod restarts, scale-outs, node drains, or fresh deploys
Time-bomb restarts - Running pods look fine until the next pull (then fail)
Security drift - Stale/archived images stop receiving fixes and lead to accumulated CVEs
Chart drift - Defaults reference repos/tags that no longer update leading to failed upgrades or silent divergence

Doing a manual impact assessment

Here are a few steps you can take to understand your exposure and mitigate associated risk:

Inventory images:

kubectl get pods -A -o json | jq -r '..|.image? // empty' | sort -u | grep -i bitnami

Search configs & charts: grep your helmfiles/values/overlays for bitnami and pinned tags

Automated Assessment: New CREs to help

We're publishing a focused set of Common Reliability Enumerations (CREs) to help you surface issues:

PREQUEL-2025-0102 (Pulling Deprecated Bitnami Images) - Detects workloads pulling Bitnami images scheduled to be deleted or moved
PREQUEL-2025-0103 (Pulling Unmaintained Bitnami Images) - Detects workloads pulling from unmaintained legacy repo
PREQUEL-2025-0104 (Pulling Latest-Only-Non-Prod Images) - Detects workloads pulling images from the latest-only non-prod repo
PREQUEL-2025-0105 (Deployment Tied to Deprecated Bitnami Images) Finds deployments that reference deprecated image locations

These CREs are cluster- and pipeline-friendly: run them pre-deployment (CI), in staging, and periodically in prod to address issues and ensure regressions don't occur.

Using Prequel to catch Bitnami risks before they break prod

Prequel is the enterprise reliability problem detection platform (from the team behind the open source Preq and CRE projects). It runs CREs continuously, examining and correlating cluster events/logs/configs, and providing guided fixes.

Why Prequel (vs. doing this by hand)

Larger exclusive CRE library covering 100s of popular technologies maintained by the Prequel Reliability Research Team (PRRT).
Distributed detection engine that connects the dots across nodes and clusters.
Web UI with guided workflows for investigation & collaboration.
Deep integrations (incident tracking, chat, CI/CD).
Control plane to manage rules, sensors, and rollouts.

You can use Prequel to continuously scan for these and other risks. Sign up for a 30-day free trial. No credit card required

Sneak Peek of Prequel Rules Catalog

Pragmatic Bitnami risk migration options

Once you understand your exposure using an automated or manual method, there are a number of steps you can take:

Identify new registries - Evaluate alternatives such as Docker Official or Hardened Images, Chainguard, to see what meets your needs and budget.
Mirror first, then refactor - Point bitnami images to a private registry mirror for faster pulls and no cut off, then replace images/charts on your schedule.
Pin by digest - Use immutable digests to lock the exact image you want, unlike tags which may move/disappear.
Automate gates - Fail builds when CREs detect deprecated Bitnami pulls in manifests or pipelines.
Prove in staging - Force a rolling restart before a cutoff window; verify image pulls and readiness gates.
Document the new defaults - Put the new repo/tag/digest and patch cadence where your team can't miss it.

Wrap-up

Ecosystem shifts like this can break prod today, or break on your next upgrade. It is increasingly impossible to keep up with all the risks that affect your stack. If you need help, let Prequel keep watch for these and 100s of other daily risks.

Try Prequel and stay ahead of breaking ecosystem changes.

10 kubectl Plugins That Help Make You the Most Valuable Kubernetes Engineer in the Room

Lyndon Brown — Thu, 29 May 2025 14:05:00 +0000

Kubernetes is insanely powerful and becomes much easier to manage when you extend kubectl with plugins. Thanks to the open-source community (and the Krew plugin manager), you can add tons of new subcommands to kubectl that streamline tasks and make cluster management easier.

But with hundreds of available plugins, how do you decide which to try?

We're sharing our favorites. All the plugins that made our list are actively maintained and compatible with recent Kubernetes versions (think 1.30+).

Let’s dive in!

1. Preq – Detect Reliability Issues Early

What is it:

preq (pronounced “preek”) is a reliability problem detector that looks for common problems in your application before they cause outages. It is powered by a community-driven catalog of failure patterns (sort of like a vulnerability database, but for reliability).

With the preq, you can run checks against your running application cluster and get alerted to bugs, misconfigurations, or anti-patterns that others have already identified.

Why it’s useful:

If you’ve ever been blindsided by a production incident, preq can be a lifesaver. It hunts through events and configurations looking for sequences that match known failure patterns.

When it finds something, it provides a detailed report (with a recommended fix) so you can act fast. In short, it brings SRE expertise to your fingertips, helping teams pinpoint and mitigate problems before they escalate. It’s like having someone checking your cluster 24/7 (for free and open source!).

This is an exciting new project, you can find and ⭐ star the repo here: https://github.com/prequel-dev/preq

Installation:

kubectl krew install preq

Pro-tip: Install the Krew plugin manager first

Example usage:

Once installed, you can run preq on a specific workload or pod. For instance, to run reliability checks on a PostgreSQL pod:

kubectl preq pg17-postgresql-0

This will scan the pod’s logs and events against the library of Common Reliability Enumeration (CRE) rules. If any known issues are detected, you’ll get an output detailing the problem and how to fix it (with references to documentation).

You can even schedule preq as a CronJob in your cluster to continuously monitor and push alerts (e.g. to Slack) when something’s amiss. In short, preq gives you proactive reliability insights that help stop outages before they happen.

2. Neat – Clean Up Verbose YAML Output

What is it:

kubectl-neat does exactly what it sounds like – it neatens up Kubernetes YAML output by removing all the clutter. When you do kubectl get ... -o yaml, the output is often filled with extra fields (status, managedFields, selfLink, etc.) that make it hard to focus on the spec. neat strips out those noisy fields and default values, leaving you with a clean manifest that’s much easier to read.

Why it’s useful:

If you find your eyes glazing over from endless autogenerated metadata, this plugin is for you. It omits fields like managedFields, status, creationTimestamp, resourceVersion, and other boilerplate that Kubernetes injects. The result is a tidy view of the resource’s actual configuration.

This is super helpful for troubleshooting or comparing manifests – you can see just the fields you or your tools defined, without the Kubernetes-added noise. In short, neat makes YAML outputs concise
and readable.

Installation:

kubectl krew install neat

Example usage:

Simply pipe any verbose output into kubectl neat. For example:

kubectl get podnginx-abc123 -o yaml | kubectl neat

This will output the YAML for pod nginx-abc123 but without all the junk (owner references, timestamps, default values, etc.).

You can then easily diff or inspect this trimmed manifest. It’s a huge timesaver when you want to quickly understand a resource’s config without wading through Kubernetes added fields.

3. View-Secret – Decode Secrets on the Fly

What is it:

No more manual Base64 decoding! The kubectl-view-secret plugin makes it effortless to view Kubernetes Secrets in plain text. Normally, if you do kubectl get secret my-secret -o yaml, you'll see Base64 encoded content for each key. With view-secret, those values are decoded to human-readable strings automatically.

Why it's useful:

Ever needed to quickly check what value was stored in a Secret? This plugin saves you tons of time. Instead of copying the encoded string and running echo ... | base64 -d for each key, you just run one command and see actual secret values. This is especially handy for Secrets with multiple keys, like TLS certs or app configs. It's also great for verifying that your Secret data is correct (in dev/test environments) without the hassle of decoding. Essentially, view-secret eliminates the manual steps when managing Secrets.

Installation:

kubectl krew install view-secret

Example usage:

To view a secret named my-secret in default namespace, just run:

kubectl view-secret my-secret --all

Adding the --all flag shows all key-value pairs. For example, you might get:

key1=supersecret
key2=topsecret

as the output. No more copy-pasting and decoding – you immediately see that key1 is "supersecret" and key2 is "topsecret". In one quick command, you have your secret values at hand (be careful where you run this though, as it will expose sensitive info in your terminal).

4. Tree – Visualize Resource Ownership

What is it:

The kubectl-tree plugin displays ownership hierarchy in your cluster in a nice tree format. Kubernetes objects often own or control other objects (e.g., a Deployment owns ReplicaSets which own Pods). tree lets you pick a top-level resource and see all its descendants laid out as an ASCII tree.

Why it's useful:

This is fantastic for understanding which resources are linked together. For example, if you have a complex app, you can run kubectl tree on a Deployment or on a CustomResource and instantly see the chain of owned objects beneath it. It's especially useful with CRDs, where the relationships might not be obvious. Instead of manually cross-referencing owners, you get a clear picture: e.g., a StatefulSet -> Pods -> PVCs, etc. This helps in cleanup (to ensure you delete dependents) and in troubleshooting cascading issues. In short, tree gives you a birds-eye view of how Kubernetes controllers have orchestrated your resources in a hierarchy.

Installation:

kubectl krew install tree

Example usage:

To see what a particular resource owns, run:

kubectl tree deploy my-app

This might output a tree of all objects created by the Deployment my-app, such as ReplicaSets and Pods. For instance:

Deployment/my-app
└─ ReplicaSet/my-app-5fd76f7d5c
   ├─ Pod/my-app-5fd76f7d5c-abcde
   └─ Pod/my-app-5fd76f7d5c-fghij

This tells you the Deployment owns a ReplicaSet, which in turn has two Pods. You can use kubectl tree on other high-level resources too (like an Ingress or a CRD instance) to reveal what's underneath. It's a superb way to navigate complex deployments and ensure you understand resource dependencies.

5. Tail – Stream Logs from Multiple Pods

What is it:

kubectl-tail (the Krew plugin name is just tail) is a handy plugin for tailing logs from multiple pods in real time. It's like a supercharged version of kubectl logs -f, allowing you to aggregate logs across several pods or even an entire label/selector. Under the hood, this plugin is based on Kail, providing Stern-like functionality directly as a kubectl plugin.

Why it's useful:

When debugging an app that's distributed across many pods (say a deployment with replicas or a microservice with multiple components), it's inconvenient to open separate log streams for each pod. kubectl tail solves this by letting you target multiple pods at once – for example, by deployment name, service name, or label selector. The logs from all matching pods are merged and streamed to your terminal. You can even filter by timeframe (--since) or specific pods. As Alex Moss noted, one great feature is targeting a higher-level resource: e.g. kubectl tail --svc=my-service to see logs from all pods behind that Service. This plugin simplifies multi-pod debugging and gives you a consolidated, live view of what's happening across your application.

Installation:

kubectl krew install tail

Example usage:

Here are a few common ways to use kubectl tail:

By namespace: View logs from all pods in a namespace (e.g. default):

kubectl tail --ns default

By label selector: Stream logs from pods matching a label, e.g. all pods with app=web:

kubectl tail -l app=web --since=10m

(This would show the last 10 minutes of logs from every pod labeled app=web, and then continue streaming.)

Multiple specific pods: If you want to tail two particular pods:

kubectl tail --pod web-abcde --pod web-fghij --since=1h

In each case, logs from all targeted pods will stream live to your terminal. The plugin even color-codes logs by pod, making it easier to distinguish sources. It's a simple but powerful way to debug issues that span many pods without juggling multiple kubectl logs commands.

6. Who-Can – Investigate RBAC Permissions

What is it:

kubectl-who-can helps you answer the question: "Who can do X in my cluster?" It's an RBAC permissions investigator. You give it an action (verb) and resource, and it tells you which users, service accounts, or roles are allowed to perform that action. Essentially, it wraps kubectl auth can-i --list logic into an easy query tool.

Why it's useful:

Kubernetes RBAC can get complicated. If you're debugging a "permission denied" error or just auditing access, this plugin is gold. Instead of manually inspecting ClusterRoleBindings, you can simply ask "who can delete pods in this namespace?" and get an immediate answer. It's particularly useful for debugging RBAC issues and ensuring your policies are set correctly. For example, if a deployment failed because it couldn't list Secrets, who-can will show you which account needs a role update. In short, it provides quick visibility into who has access to what, saving you from hunting through YAML and docs.

Installation:

kubectl krew install who-can

Example usage:

Say you want to find out who can delete pods in namespace foo:

kubectl who-can delete pods --namespace foo

The plugin will return a list of subjects (users, serviceaccounts, roles) that have that capability. For instance, you might see that a certain RoleBinding gives the "deploy-bot" service account the delete permission on pods, and maybe cluster admins can too. You can also run broader queries, like:

kubectl who-can get secret/db-password

to see who can read the db-password secret. This is incredibly useful for security audits—quickly verify that only the intended identities have access. In summary, who-can turns RBAC from a mystery into an answerable question, helping you secure and troubleshoot your cluster's access control.

7. kubectx – Swiftly Switch Contexts

What is it:

kubectx is a popular command-line tool (and kubectl plugin) for fast context switching. In Kubernetes, a "context" is essentially which cluster + user you're currently using. If you work with multiple clusters (dev, staging, prod, or multiple cloud providers), kubectx lets you flip between them with a single short command, instead of typing kubectl config use-context ... each time.

Why it's useful:

Working in the wrong cluster can be disastrous ("oops, I just ran that in prod!"). kubectx makes it trivial to see your available contexts and switch in a heartbeat. It increases productivity for those managing multiple clusters by removing friction from context changes. It also supports tab-completion, so you can quickly auto-complete context names. Both beginners and pros love this because it simplifies multi-cluster workflows—keeping you from accidentally operating in the wrong environment and saving you time every day.

Installation:

You can install via Krew (there are plugins named ctx and ns for kubectx/kubens). To use Krew, run:

kubectl krew install ctx

Example usage:

After installation, switching contexts is as easy as:

kubectx prod-cluster

This will switch your current kubectl context to "prod-cluster" (whatever name you have in your kubeconfig). To list all contexts, just run kubectx with no arguments and it will show an interactive list. You can also shorten context names or set up aliases for convenience. With this, managing multiple clusters (like bouncing between dev → staging → prod) becomes a breeze. Pair it with kubens (below) for full power.

8. kubens – Speedy Namespace Switching

What is it:

A perfect companion to kubectx, kubens lets you quickly switch between Kubernetes namespaces. Rather than typing -n <namespace> every time or editing your context, you just run kubens <name> and it changes your current namespace in the kubeconfig context.

Why it's useful:

Kubernetes namespaces help divide resources, but it's tedious to constantly specify -n or modify context YAML by hand. kubens streamlines this by making namespace changes a single short command. This is great for both beginners (who might forget to target the right namespace) and advanced users managing multi-tenant clusters. It prevents mistakes like deploying to the default namespace unintentionally. Combined with kubectx, you can navigate clusters and namespaces with ease. It's all about efficiency: less typing, less context switching in your head, and more focus on what you're deploying or debugging.

Installation:

kubectl krew install ns

Example usage:

To switch to the kube-system namespace (in your current context):

kubens kube-system

Your default namespace for kubectl commands is now kube-system until you switch again. Running kubens with no arguments lists all namespaces in the current context, so you can pick one interactively. For instance, if you're juggling multiple projects in a cluster, kubens lets you jump between dev, test, and prod namespaces instantly. No more forgetting to add -n and wondering "why can't it find my pods?"—this tool keeps your namespace context correct at all times.

9. Kube-Score – Lint Your Kubernetes Manifests

What is it: kube-score (available via the Krew plugin score) is a static code analysis tool for Kubernetes YAML resources. In simpler terms, it's like a linter or "config validator" for your deployment files. You run it on your manifests (deployments, services, etc.), and it scores them and gives recommendations for improvements.

Why it's useful:

Kubernetes will happily accept configs that are syntactically valid but follow bad practices. kube-score flags those issues before you apply them. It checks for things like missing resource limits, improper health checks, deprecated API versions, and many other best-practice violations. The output is a list of suggestions on what to improve for better reliability and security. This is extremely useful for validating configuration files – you catch mistakes or omissions early, in CI/CD or during development. Essentially, kube-score acts as a quality gate for your Kubernetes manifests, ensuring they adhere to recommended standards (so you don't get surprises at runtime).

Installation:

kubectl krew install score

(This installs the kubectl score plugin, which internally uses the kube-score tool.)

Example usage:

To analyze a file (or directory of YAMLs) for issues, run:

kubectl score my-app.yaml

The plugin will output a report, for example:

[WARNING] Deployment my-app: Container without resource limits
↳ It's recommended to set resource limits for containers to avoid resource hogging.

[CRITICAL] Service my-app-service: Uses targetPort name that doesn't match any container port
↳ The targetPort "http" is not found in any container of the associated pods.

Each finding comes with a severity and an explanation. You'd then go back and fix those in your YAML. You can also run kubectl score -f <folder> to scan multiple manifests at once. This plugin is perfect for validating Kubernetes config files as part of code reviews or CI pipelines. It helps both newbies and experts by pointing out potential misconfigurations (before they hit your cluster).

10. Sniff – Capture Pod Traffic Like a Pro

What is it:

Ever wished you could tcpdump inside a Kubernetes pod? kubectl sniff (a.k.a. ksniff) is a plugin that makes that possible. It uses tcpdump and Wireshark under the hood to capture network traffic from any pod in your cluster. With one command, it starts a remote packet capture on the target pod and streams the data for you to open in Wireshark or analyze with other tools.

Why it's useful:

Debugging network issues in Kubernetes can be tricky – you often need to see the raw traffic. sniff automates the heavy lifting of deploying a capture container alongside your pod and piping the output to your workstation. You get the full power of Wireshark for inspecting packets, with minimal impact on the running pod. This is incredibly useful for advanced troubleshooting: e.g., investigating why a service isn't responding, checking if a pod is actually making calls to an external API, or diagnosing weird networking behavior. Instead of crafting tcpdump commands on a node or modifying the pod, you run one plugin and get a pcap of what's happening. It's a game-changer for network debugging in Kubernetes clusters.

Installation:

kubectl krew install sniff

(Note: You'll need Wireshark installed locally for live capture, or you can output to a pcap file and open it later. Also, the target pod's node needs to allow running the capture – the plugin can handle many scenarios including non-privileged containers by using a helper Pod.)

Example usage:

To start capturing traffic from a pod my-pod in namespace default:

kubectl sniff my-pod -n default -c main-container

Here -c specifies the container (if omitted, it defaults to the first container in the pod). By default, sniff will launch Wireshark on your machine showing the live traffic from my-pod. You can apply capture filters with -f, for example -f "port 80" to only capture web traffic. If you prefer to save to file instead of live view, use the -o flag to write a pcap:

kubectl sniff my-pod -n default -o output.pcap

After running this, you'll have output.pcap with all packets captured from my-pod's network interface. Open that in Wireshark and you can dissect the traffic at your leisure. kubectl sniff brings deep network insight to Kubernetes – previously, you might have had to exec into the node or use complex setups, but now it's one simple command. It's an advanced tool that becomes surprisingly approachable thanks to this plugin.

Conclusion:

These kubectl plugins can dramatically enhance your productivity and capabilities with Kubernetes. From validating your configs to debugging live clusters, they fill in gaps that the default kubectl doesn't cover. Best of all, they integrate seamlessly – you invoke them as if they were native kubectl commands. Go ahead and try installing a few that pique your interest (via Krew), and you'll wonder how you managed Kubernetes without them!

Enjoy! (And let us know in the comments which kubectl plugin is your favorite, or if we missed one that you think should have be in our top 10.)