Prithiviraj R

Posted on Nov 21

Kubernetes Troubleshooting with K8sGPT + Amazon Bedrock

#ai #aws #kubernetes #devops

Introduction
Troubleshooting Kubernetes issues often requires switching between logs, events, and documentation. K8sGPT simplifies this by using AI to analyze cluster problems and generate human-readable explanations and solutions.
In this summary, I’ll walk through how I used K8sGPT with Amazon Bedrock to diagnose issues inside an EKS cluster and how it improved the entire troubleshooting experience.

What is K8sGPT?
K8sGPT is an open-source CLI and operator that scans Kubernetes resources and uses AI models to:

Detect misconfigurations
Explain errors in simple language
Provide actionable fixes
Recommend best practices

Why Integrate with Amazon Bedrock?
Amazon Bedrock enhances K8sGPT by offering:

Multiple LLM choices (Nova, Claude, Llama)
Secure, enterprise-ready identity (no API keys)
Low-latency inference from AWS regions
Cost savings using lightweight models like Nova Lite Setup Overview For the demo, I used: EKS cluster

Broken and misconfigured workloads

K8sGPT with Amazon Bedrock Nova Lite model

1. Create the EKS Cluster

I used a simple eksctl configuration to provision a small EKS cluster.

2. Install K8sGPT

Download the K8sGPT CLI and place it in your system’s PATH.
The installation is straightforward — just download the binary, extract it, and move it into /usr/local/bin/.

3. Configure K8sGPT to Use Bedrock

K8sGPT supports Bedrock natively.
You configure it by adding:

Backend provider → amazonbedrock
Bedrock region → us-east-1
Model → amazon.nova-lite-v1:0

Then set Bedrock as the default provider.

4. Create Problematic Kubernetes Resources

To test K8sGPT, I intentionally deployed workloads with issues:

Examples of injected failures

Broken Image Pod → invalid image registry
Resource Heavy Pod → impossible CPU/memory requests
StatefulSet with Wrong StorageClass → PVC creation failure

These represent issues commonly faced in real Kubernetes environments.

5. Run K8sGPT Analysis

Once workloads are deployed:
k8sgpt analyze

k8sgpt analyze --filter Pod --namespace k8sgpt-demo --explain | head -20

K8sGPT quickly identifies issues and generates fixes using AI.

Example Output (Summarized)

Image Pull Error Issue: Back-off pulling image Fix: Correct the registry/image tag
Insufficient Resources Issue: Pods cannot schedule Fix: Adjust CPU/memory or scale nodegroup
Invalid StorageClass Issue: PVCs stuck in Pending Fix: Update storage class or create a valid one

These explanations are far more readable compared to raw Kubernetes error messages.

Essential commands

1. Essential Test Commands:

Basic Analysis (No AI)

k8sgpt analyze

2. AI-Powered Analysis (Uses Bedrock)
k8sgpt analyze --explain

3. Specific Namespace

k8sgpt analyze --explain --namespace k8sgpt-demo

4.Specific Resource Type

k8sgpt analyze --explain --filter Pod

k8sgpt analyze --explain --filter Deployment

5. Multiple Filters

k8sgpt analyze --explain --filter Pod,Deployment,Service

6. Check Configuration

k8sgpt auth list

k8sgpt filters list

7. Verbose Output

k8sgpt analyze --explain --verbose

Benefits Observed

1. Intelligent Issue Detection

K8sGPT identified issues across pods, StatefulSets, services, and jobs.

2. Human-Friendly Explanations

It rewrites cryptic Kubernetes errors into simple sentences.

3. Actionable Remediation Steps

Includes:

kubectl commands
Configuration fixes
Architecture recommendations

4. Speed

It analyzed more than a dozen issues in seconds.

Amazon Bedrock Model Options

Model	Speed	Quality	Cost	Best Use Case
Nova Lite	Fastest	Great	Cheapest	Day-to-day troubleshooting
Nova Pro	High	Excellent	Moderate	Complex multi-resource analysis
Claude 3 Haiku	Fast	High	Good	Detailed explanations

Best recommendation: Start with Nova Lite.

Fixing Issues (Summary)

Fix Image Pull Failure

Delete the broken pod and recreate it with a valid image.

Fix Misconfigured Deployment

Patch Deployment or update chart values.

Fix Storage Issues

Correct the StorageClass or create a valid one.

Production Considerations

Security

Use IAM Roles
Avoid embedding credentials
Enable CloudTrail auditing

Cost

Monitor Bedrock usage
Use model tiering based on workload

Integrations

Prometheus + Grafana
Alerting systems
CI/CD pre-deployment checks

In-Cluster Operator Deployment

You can deploy K8sGPT operator using Helm and set Bedrock as the backend for ongoing cluster analysis.

Before vs After K8sGPT

Before

Manual debugging
Slow MTTR
High cognitive load
Heavy dependency on senior engineers

After

Fast AI-driven diagnostics
Consistent solutions
Junior engineers troubleshoot confidently
Reduced MTTR significantly

Conclusion

K8sGPT combined with Amazon Bedrock is a powerful way to modernize Kubernetes troubleshooting. It minimizes time spent on debugging, improves team efficiency, and brings clarity to complex cluster issues.

If you're managing EKS environments at scale, this integration provides:

Faster resolutions
Clearer insights
Better operational consistency

The future of Kubernetes troubleshooting is AI-driven, and tools like K8sGPT make that future easy to adopt.

Quick Start Checklist

Configure AWS credentials
Install K8sGPT
Add Amazon Bedrock as AI backend
Deploy workloads
Run k8sgpt analyze
Apply the recommended fixes

Happy Learning
Prithiviraj Rengarajan
DevOps Engineer

DEV Community