DEV Community

Prithiviraj R
Prithiviraj R

Posted on

Kubernetes Troubleshooting with K8sGPT + Amazon Bedrock

Introduction
Troubleshooting Kubernetes issues often requires switching between logs, events, and documentation. K8sGPT simplifies this by using AI to analyze cluster problems and generate human-readable explanations and solutions.
In this summary, I’ll walk through how I used K8sGPT with Amazon Bedrock to diagnose issues inside an EKS cluster and how it improved the entire troubleshooting experience.

What is K8sGPT?
K8sGPT is an open-source CLI and operator that scans Kubernetes resources and uses AI models to:

  • Detect misconfigurations
  • Explain errors in simple language
  • Provide actionable fixes
  • Recommend best practices

Why Integrate with Amazon Bedrock?
Amazon Bedrock enhances K8sGPT by offering:

  • Multiple LLM choices (Nova, Claude, Llama)
  • Secure, enterprise-ready identity (no API keys)
  • Low-latency inference from AWS regions
  • Cost savings using lightweight models like Nova Lite Setup Overview For the demo, I used: EKS cluster

Broken and misconfigured workloads


K8sGPT with Amazon Bedrock Nova Lite model

1. Create the EKS Cluster

I used a simple eksctl configuration to provision a small EKS cluster.

2. Install K8sGPT

Download the K8sGPT CLI and place it in your system’s PATH.
The installation is straightforward — just download the binary, extract it, and move it into /usr/local/bin/.

3. Configure K8sGPT to Use Bedrock

K8sGPT supports Bedrock natively.
You configure it by adding:

  • Backend provider → amazonbedrock
  • Bedrock region → us-east-1
  • Model → amazon.nova-lite-v1:0

Then set Bedrock as the default provider.

4. Create Problematic Kubernetes Resources

To test K8sGPT, I intentionally deployed workloads with issues:

Examples of injected failures
  • Broken Image Pod → invalid image registry
  • Resource Heavy Pod → impossible CPU/memory requests
  • StatefulSet with Wrong StorageClass → PVC creation failure

These represent issues commonly faced in real Kubernetes environments.

5. Run K8sGPT Analysis

Once workloads are deployed:
k8sgpt analyze


k8sgpt analyze --filter Pod --namespace k8sgpt-demo --explain | head -20
Enter fullscreen mode Exit fullscreen mode

K8sGPT quickly identifies issues and generates fixes using AI.

Example Output (Summarized)

  • Image Pull Error Issue: Back-off pulling image Fix: Correct the registry/image tag
  • Insufficient Resources Issue: Pods cannot schedule Fix: Adjust CPU/memory or scale nodegroup
  • Invalid StorageClass Issue: PVCs stuck in Pending Fix: Update storage class or create a valid one

These explanations are far more readable compared to raw Kubernetes error messages.

Essential commands

1. Essential Test Commands:

  1. Basic Analysis (No AI)

k8sgpt analyze

2. AI-Powered Analysis (Uses Bedrock)
k8sgpt analyze --explain

3. Specific Namespace

k8sgpt analyze --explain --namespace k8sgpt-demo

4.Specific Resource Type

k8sgpt analyze --explain --filter Pod

k8sgpt analyze --explain --filter Deployment

5. Multiple Filters

k8sgpt analyze --explain --filter Pod,Deployment,Service

6. Check Configuration

k8sgpt auth list

k8sgpt filters list

7. Verbose Output

k8sgpt analyze --explain --verbose

Benefits Observed

1. Intelligent Issue Detection

K8sGPT identified issues across pods, StatefulSets, services, and jobs.

2. Human-Friendly Explanations

It rewrites cryptic Kubernetes errors into simple sentences.

3. Actionable Remediation Steps

Includes:

  • kubectl commands
  • Configuration fixes
  • Architecture recommendations
4. Speed

It analyzed more than a dozen issues in seconds.

Amazon Bedrock Model Options

Model Speed Quality Cost Best Use Case
Nova Lite Fastest Great Cheapest Day-to-day troubleshooting
Nova Pro High Excellent Moderate Complex multi-resource analysis
Claude 3 Haiku Fast High Good Detailed explanations

Best recommendation: Start with Nova Lite.

Fixing Issues (Summary)

Fix Image Pull Failure

Delete the broken pod and recreate it with a valid image.

Fix Misconfigured Deployment

Patch Deployment or update chart values.

Fix Storage Issues

Correct the StorageClass or create a valid one.

Production Considerations

Security
  • Use IAM Roles
  • Avoid embedding credentials
  • Enable CloudTrail auditing
Cost
  • Monitor Bedrock usage
  • Use model tiering based on workload
Integrations
  • Prometheus + Grafana
  • Alerting systems
  • CI/CD pre-deployment checks
In-Cluster Operator Deployment

You can deploy K8sGPT operator using Helm and set Bedrock as the backend for ongoing cluster analysis.

Before vs After K8sGPT

Before
  • Manual debugging
  • Slow MTTR
  • High cognitive load
  • Heavy dependency on senior engineers
After
  • Fast AI-driven diagnostics
  • Consistent solutions
  • Junior engineers troubleshoot confidently
  • Reduced MTTR significantly

Conclusion

K8sGPT combined with Amazon Bedrock is a powerful way to modernize Kubernetes troubleshooting. It minimizes time spent on debugging, improves team efficiency, and brings clarity to complex cluster issues.

If you're managing EKS environments at scale, this integration provides:

  • Faster resolutions
  • Clearer insights
  • Better operational consistency

The future of Kubernetes troubleshooting is AI-driven, and tools like K8sGPT make that future easy to adopt.

Quick Start Checklist

  1. Configure AWS credentials
  2. Install K8sGPT
  3. Add Amazon Bedrock as AI backend
  4. Deploy workloads
  5. Run k8sgpt analyze
  6. Apply the recommended fixes

Happy Learning
Prithiviraj Rengarajan
DevOps Engineer

Top comments (0)