Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

#linux #sre #shell #cli

Most AI coding tools assume you're sitting in front of a repo.
There's a working directory, some source files, tests, maybe a CI pipeline. The AI reads your code, suggests changes, ru****ns tests. Great model — if you're a developer writing code on a Tuesday afternoon.
Now picture the other scenario.
It's 2am. PagerDuty fires. You SSH into a box you haven't touched in three months. Something is broken, you're not sure what, and the runbook was last updated by someone who left the company in 2022.
You're not thinking about repos. You're thinking: what OS is this thing running? What just failed? Is it safe to restart that service or will I make it worse?
These are two fundamentally different workflows. But almost every AI terminal tool I've seen is built for the first one.

The 30 minutes nobody builds for
There's a ton of tooling for the world before you log into the box: Prometheus, Grafana, PagerDuty, incident.io, runbooks, dashboards. All useful. No complaints.
But there's this practical 30-minute window after you SSH in where you're basically doing archaeology with journalctl and grep. You check systemd. You look at disk. You read logs that were clearly written by someone who hated future-you. You copy-paste terminal output into Slack so your teammate can squint at it from a different timezone.
This is where I think AI could actually help. Not by replacing Grafana. Not by building another dashboard. Just by being present in the shell while you're debugging.

Please, for the love of uptime, don't replace my shell
I've seen a few AI terminal projects that basically build an entire new terminal experience from scratch. New keybindings, new UI, new everything.
Here's the thing: ops people have muscle memory. We have aliases we wrote in 2019 and forgot about. We have tmux configs we'd defend with our lives. We SSH through jump hosts with key forwarding chains that barely work but have worked for years so nobody touches them.
If your AI tool requires me to abandon all of that, I'm not going to trust it. And trust is everything when you're root on a production box at 2am.
The approach I believe in: add AI on top of the existing shell. Explain the failed command. Summarize the relevant logs. Suggest what to check next. But let me keep my shell, my workflow, my muscle memory.
AI should be an assistant layer, not a hostile takeover of my terminal.

Node context ≠ project context
This is the part that most coding-AI-turned-ops-AI projects get wrong.
Coding assistants are project-aware: they know the repo, the files, the dependencies. That makes sense for development.
Ops needs something different: node-aware context.
When I'm debugging a service failure, I need the AI to understand that I'm on an Ubuntu 22.04 box running kernel 5.15, with systemd managing nginx, and that this machine had a disk pressure incident two weeks ago. I need it to know that chmod 777 /var/lib/my-service is not just a string — it's a risk that depends on what this node does in production.
That's a very different context model from：
"here's your package.json."

Safety is a timing problem
Here's a thing that keeps me up at night (besides PagerDuty):
An AI that's helpful after you run rm -rf on the wrong directory is not helpful. It's a very polite post-mortem assistant.
The guardrails need to happen before execution. Useful categories:

this command is read-only, go ahead
this command changes state, here's what it does
this command deletes things, are you sure?
this command touches production traffic, maybe verify first?

And for most debugging sessions, the first mode should just be read-only diagnosis. Collect facts. Inspect logs. Explain what's probably wrong. Suggest verification commands. Only start talking about changes after the human explicitly says "ok, let's fix it."
A correct command at the wrong time is still an incident. Ask anyone who's run a database migration during peak traffic.
Don't ask me. Definitely don't ask me.

What this is not
I want to be clear about scope, because "AI for ops" can mean anything from "a smarter grep" to "we're going to automatically self-heal your entire infrastructure."
This is not an AIOps platform. Not a replacement for your monitoring stack. Not a magic auto-remediation engine. Not a new terminal emulator.
It's a narrower question: after you SSH into a Linux box, can AI help you figure out what's wrong faster, with less copy-pasting and fewer chances of making things worse?
That's it. That's the whole pitch.

What I'm working on
I've been building something along these lines called aish — an AI shell runtime that sits inside your real terminal and focuses on Linux ops troubleshooting.
The core loop: you run commands normally. When something fails, you ask the AI. It uses the last command, exit code, output, and node context to help diagnose. It warns before risky operations. It stays out of the way when you don't need it.
It's not glamorous work. Most of the problems it deals with are things like systemd unit failures, broken package repos, permission issues, disk pressure, and the eternal "why won't nginx start." The kind of stuff that isn't exciting until it's your problem at 2am.
Currently at v0.3.4 (recently rewrote the core from Python to Rust, which is a story for another post involving haunted PTYs).
If you spend time debugging Linux boxes and have opinions about what would actually help in that workflow, I'd genuinely like to hear them — either here or on GitHub.

DEV Community

Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

Top comments (0)