DEV Community

Abhi
Abhi

Posted on

I Got Tired of Googling kubectl Commands at 2 AM. So I Built a Local AI Agent That Does DevOps Safely. { pip install orbit-cli }

The Problem Nobody Talks About

I deployed a couple of applications recently. Simple stuff — containerize, push to a registry, get it running on Kubernetes. Should take an afternoon, right?

It took me three days.

I'm still learning Docker and Kubernetes. Every step was a Google search. "How to write a multi-stage Dockerfile." "Why is my pod in CrashLoopBackOff." "What's the difference between kubectl apply and kubectl create." I'd find a Stack Overflow answer, copy the command, run it, get a different error, go back to Google.

And every time I needed to demo a quick POC to my team, the same cycle repeated. It was painful.

So I had an idea — what if I built a tool that already knows all of this? Something that can take a goal like "deploy this app to Kubernetes" and actually figure out the steps, run them safely, and fix things when they break?

Three weeks later, after every night and every weekend alongside my day job, that tool exists. It's called Orbit.


The Conversation That Started It

My friend Sidd works in DevOps. We were talking one evening about what makes DevOps tooling painful, and he said something that stuck with me:

"The problem isn't that the commands are hard. It's that you have to hold 15 things in your head at once — what namespace you're in, what branch you're on, whether you're pointing at prod or staging, what the last error was."

That clicked. The real problem isn't knowledge — it's context. A tool that could see your entire environment (git state, running containers, K8s cluster, system info) and factor all of that into its decisions would be genuinely useful.

Sidd kept pushing me on what DevOps folks actually need. Not another chatbot wrapper. Something that understands risk. Something that won't let you accidentally delete production. Something that runs locally so your infrastructure details stay on your machine.

That became the design spec for Orbit.


Why Everything Runs Locally (and Why That Matters)

The first design decision was: nothing leaves your machine.

I use Ollama as the LLM backend. Every model runs locally. Your kubectl configs, your Docker setup, your git history, your environment variables — none of it gets sent to OpenAI or Anthropic or anyone else.

This isn't just a privacy thing (though it is). It's a practical thing. If you're working with production infrastructure, you don't want your cluster details, namespaces, pod names, and error logs flowing through a third-party API. Period.

Ollama has gotten surprisingly good. Models like Qwen 2.5 at 7B parameters can generate structured JSON reliably, follow system prompts, and reason about shell commands. Not GPT-4 level, but good enough for DevOps task planning — and it runs on my MacBook in seconds.


The Engineering: How Orbit Actually Works

The Agent Loop

When you run orbit do "find why my pods are crashing and fix it", here's what actually happens:

Goal → Scan Environment → Decompose into Subtasks → Select Models
→ Allocate Context Budget → Generate Plan → [For each step:]
  Classify Risk → Confirm with User → Execute → Observe Result
  → Success? Next step. Failed? Replan.
→ Summary
Enter fullscreen mode Exit fullscreen mode

This isn't a simple "send prompt to LLM, run the output" pipeline. Each stage is its own component with its own logic.

Environment Scanning runs 5 collectors in parallel using asyncio: git state, Docker containers, Kubernetes cluster, system info, and filesystem structure. Each collector is fault-tolerant — if you don't have kubectl installed, the K8s collector returns empty instead of crashing. Everything has a 5-second timeout. Results are cached with a 5-second TTL so the agent loop doesn't re-scan on every step.

Task Decomposition takes your goal and breaks it into subtasks, each tagged with a capability requirement: fast_shell for simple commands, code_gen for generating scripts, reasoning for complex analysis. This matters because...

Multi-Model Routing picks the best locally-available model for each capability. Not every model is good at everything. A small fast model can handle ls and grep, but you want your beefiest model for debugging a cascade failure. The router scans your ollama list, maps each model to capabilities based on known benchmarks, and assigns models to subtasks. No LLM call needed — it's a deterministic lookup.

Context Budget Allocation is token-aware. Each context slot (git info, Docker status, etc.) has a relevance score and estimated token count. The allocator greedily fills the context window by relevance, truncating the last slot if needed. Three truncation strategies: head (for logs), tail (for diffs), and summary (first half + "[truncated]" + last half).


The Safety System (The Part I'm Most Proud Of)

Here's the thing about AI agents that run shell commands: they can destroy things. rm -rf /. kubectl delete namespace production. git push --force.

Most AI tools handle this by asking the LLM "is this command safe?" That's insane. You're asking the same system that generated the command to evaluate whether it's dangerous. That's like asking the intern who wrote the script whether it's safe to run in production.

Orbit's safety classifier is regex-only. Zero LLM calls. 173 hand-written regex patterns that classify every command into four tiers:

Tier What Happens Examples
Safe Runs silently ls, cat, kubectl get, docker ps, git log
Caution Single confirmation prompt git push, docker build, kubectl apply, pip install
Destructive Impact analysis + double confirmation rm, kubectl delete, git reset --hard, git push --force
Nuclear Type "i am sure" + 3-second cooldown rm -rf /, DROP TABLE, terraform destroy, any destructive command in production

The critical design rule: unrecognized commands default to caution, never safe. If the classifier doesn't recognize your command, it assumes risk.

But the really clever part is production detection. Orbit checks your git branch, K8s namespace, and K8s context for production indicators (main, master, release/*, prod, production, live). If it detects production, any escalatable command gets bumped to nuclear automatically.

So git push origin main in a feature branch? Caution (single confirm). git push origin main when you're ON main? Nuclear. Type "i am sure" and wait 3 seconds. Because that push is going to production.

This saved me during development. I was testing with my actual git repo and almost pushed garbage to main. The escalation caught it.


Auto-Replanning: When Things Go Wrong

Real DevOps isn't linear. Commands fail. Pods crash. Builds break. An agent that can only execute a static plan is useless.

Orbit's observer watches every command result. If a step fails and there's replan budget remaining, it feeds the error back to the LLM with context about what already succeeded, and gets a new plan for the remaining steps. No re-running successful steps. The replanner addresses the specific error.

But replanning has hard limits. The agent budget enforces: max 15 steps, max 3 replans per step, max 25 total LLM calls. When the budget is exhausted, Orbit exits gracefully with a summary of what it accomplished and what failed. No infinite loops. No runaway token consumption.

Rollback Plans

Every destructive step gets a rollback plan. git reset --hard? Rollback via git reflog. kubectl apply -f deploy.yaml? Rollback is kubectl delete -f deploy.yaml. docker compose down? Rollback is docker compose up -d.

Some things can't be rolled back (rm, kubectl delete pod). Orbit tells you that explicitly: "File deletion is irreversible. Check backups."


The Numbers

300 tests. All passing. 3.09 seconds.

I didn't write 300 tests to pad a number. Each test validates a specific behavior:

  • 129 safety classifier tests — every regex disambiguation edge case. sed vs sed -i. rm -rf ./dir (destructive) vs rm -rf /tmp (nuclear). git stash list (safe) vs git stash pop (caution). Production escalation for every git branch variant.
  • 53 agent tests — real subprocess execution with timeout/kill, streaming output, observer decisions, planner model selection fallback chains, budget enforcement.
  • 43 context tests — parallel scanner with fault tolerance, cache TTL, context budget truncation strategies (head/tail/summary), allocation edge cases.
  • 18 router tests — model capability matching, priority lookup, decomposer with LLM fallback.
  • 57 more across schemas, config, CLI, memory, and LLM provider interfaces.

Every Pydantic model validates. Every JSON schema generates correctly. Every error path returns a safe fallback instead of crashing.


What I Learned Building This

Claude Code Made This Possible

I need to be honest about this: I couldn't have built Orbit in three weeks of nights and weekends without AI coding tools. Claude Code was a massive multiplier.

The pattern was: I'd think through what I wanted (the safety classifier design, the context budget allocator, the observer decision logic), then use Claude Code to help me write it, iterate on edge cases, and generate test coverage. The architecture and design decisions were mine (and Sidd's input). The implementation velocity came from having a coding partner that doesn't sleep.

This is the reality of building software in 2025. The ideas, the architecture, the "what should this do and why" — that's still deeply human. The "write me a regex that matches kubectl delete with a negative lookahead for namespace" — that's where AI shines.

The Regex Safety Classifier Was The Hardest Part

Not the LLM integration. Not the async context scanning. The 173 regex patterns.

Every pattern has to be precise. sed without -i is safe (just prints to stdout). sed -i modifies files in place — that's caution. The safe pattern uses (?!.*-i) negative lookahead to exclude the -i variant. Get that wrong and you're either blocking harmless commands or letting dangerous ones through silently.

kubectl delete pod is destructive. kubectl delete namespace is nuclear. kubectl delete pods --all is nuclear. Three different patterns, ordered so nuclear matches first. First match wins.

I spent an entire weekend just on the safety patterns. Writing them, testing them, finding edge cases, fixing them, finding more edge cases. It's the kind of work that's tedious but existentially important when your tool runs shell commands.

Structured Output Is The Right Way To Talk To LLMs

Every single LLM call in Orbit returns structured output. Pydantic model → JSON schema → Ollama's format= parameter. The model returns JSON that matches the schema, Pydantic validates it, and the code gets typed data.

No parsing free text. No "extract the command from between the backticks." No regex on LLM output. If the JSON doesn't validate, retry once, then return a safe fallback (empty plan, single general subtask, etc.).

With temperature=0, Qwen 2.5 generates valid JSON matching the schema 95%+ of the time. The retry catches most of the rest. The fallback catches the remainder. Three layers of defense.


Try It

pip install orbit-cli
Enter fullscreen mode Exit fullscreen mode

You'll need Python 3.11+ and Ollama running with at least one model:

ollama pull qwen2.5:7b
Enter fullscreen mode Exit fullscreen mode

Then:

orbit do "check disk usage and find the largest directories"
orbit sense                    # see what Orbit sees in your environment
orbit wtf                      # debug the last failed command
orbit ask "why is my pod in CrashLoopBackOff?"
Enter fullscreen mode Exit fullscreen mode

The code is on GitHub: github.com/abhimanyubhagwati/orbit-cli

This is my second open source release after TraceForge (a testing tool for AI agents). Building tools at night alongside a day job isn't easy, but it's the most fun I have writing code.

If you try it — drop a comment below and tell me what command you ran first. And if it does something unexpected, open an issue. That feedback is exactly what makes this better.


Orbit is Apache 2.0 licensed. Built by Abhimanyu Bhagwati.

Top comments (1)

Collapse
 
nyrok profile image
Hamza KONTE

The safety-first approach is the right call for a DevOps agent — explicit constraints on what it can and can't do are more reliable than hoping it figures out the boundary itself. That constraints block in the system prompt is doing a lot of heavy lifting. I use flompt to design those structured agent prompts with explicit constraint sections. flompt.dev / github.com/Nyrok/flompt