Escaping DevOps hell with Codex

#devops #ai #ansible #productivity

If you are a developer, you are probably well aware of all the AI goodness that has been happening. I won't bore you with the hyperbole.

My weapon of choice is Codex. Other AI coding tools are available, but debating which one is best at what is not a very interesting conversation to me. What I want to focus on here is DevOps.

I'm the CTO of a small company, which means I switch hats a lot. I used to call myself a backend developer, but these days I do everything vaguely tech-related in our company, plus marketing, sales, and a bunch of other supporting roles. There are only a few of us in the company. If it needs doing, one of us needs to do it. ChatGPT and Codex allow us to get shit done.

DevOps sucks. It really does.

One of the hats I end up wearing is DevOps. I hate doing DevOps. I've been exposed to this stuff for well over two decades. I know how to do it properly. I've done everything from uploading zip files over ISDN lines and remotely restarting Tomcat, as you did in the early 2000s, to automating deployments with Puppet, using CloudFormation in AWS, faffing about with Kubernetes, Docker Swarm, Terraform, and much more. Lately, my weapons of choice are Ansible, Docker, and Docker Compose. I've invested countless quantities of time in learning all that stuff and trying to apply it.

Why do I think DevOps sucks? To me, it feels like dropping out of warp, to use a Star Trek analogy. You have all these grand plans to get some big feature out, and then you find yourself micromanaging some insanely arcane shit in Linux to get it to tell the time correctly, deal with some convoluted networking thing, or whatever. You get blocked for weeks on end. All that to solve the age-old problem of "put this fucking thing over there and run it!" (pardon my French). I call this problem inception. You start with a grand vision: "Our shiny new backend is ready to go, let's deploy it and announce it to the world." Somehow, that escalates into: "I need to figure out how to set up a bastion and private networking so I don't expose my database to the public internet." One thing leads to another, and before you know it, you've sunk three months into the project.

DevOps should be simple but isn't

DevOps is supposed to be about automating what should be automated. So, why does DevOps still feel so manual? The answer is that this stuff is genuinely complicated, and over decades we have built systems full of bear traps with terrible failure modes: data loss, security breaches, downtime, and worse. There is just a lot of stuff that a DevOps person needs to know and do. Taking shortcuts can lead to disaster. That's why it often ends up being a full-time job.

Every once in a while, I get sucked into doing a stretch of DevOps that makes me feel stupid, because it should be simple. Instead, I end up pulling my hair out for days trying to solve weird shit that refuses to work without ritualistic bullshit, magical command-line incantations, and configuration files that need to be exactly right. I know some truly excellent operations and DevOps people and, honestly, I suffer a bit from imposter syndrome whenever I have to do this stuff myself. I'm skilled enough to be dangerous, and I know it.

Codex takes the pain away

I got pulled into the latest round of this two weeks ago. We've been eyeing our setup in Google and concluded that we're spending about 10K per year on hosting. It works great, but it's a lot, and we'd prefer to give ourselves a little raise instead of donating to Google. So, I embarked on the plan to migrate to Hetzner.

But I used Codex to do it. We also have a second setup that, for customer reasons, runs in Telekom Cloud, which is basically an OpenStack-based environment. I already had a lot of Ansible scripts to provision that.

I started by telling Codex to refactor and modernize that codebase and set up a new inventory for my brand-new Hetzner setup. I created a few VMs in Hetzner, a private network, and a load balancer. One of the VMs acts as a bastion so you can SSH into it to reach the other ones that don't have a public IP address.

In small steps, I fixed, upgraded, and modernized the Ansible scripts, using the new Hetzner setup as the test bed. I let Codex do all the work. I got it to fix the Ansible code and drive the provisioning through the tools on my laptop and over SSH. I set up skills and guardrails around the process.

When the Ansible scripts failed, I got it to debug why and implement fixes. I got it to research workarounds. A lot of this was me guiding it along the way. I was leaning on my 2+ decades of experience, but I did not touch a single line of code.

This is actually key. As you go, you will see it struggle and figure things out. When that happens, you just ask it to record what it just did as a skill. You read over the markdown it generates, suggest further improvements, and the next time it does better. This feedback loop between you and the AI is super important. It's the difference between doing things your way, i.e. the right way, and doing it in whatever way it dreams up. Knowing what the right way is, is an experience-driven thing. And this is now our added value as software makers.

After 3 days, I switched over the DNS and shut down the last of our Google VMs. That's 3 days from creating the Hetzner account, not counting the extra day I spent getting them to raise my VM limit because I was still in my first month.

Doing rolling restarts with Elasticsearch

Part of our setup is Elasticsearch. If you've never used it, it's a clustered search engine that achieves high availability through node and cluster coordination. To upgrade it, you have to do a so-called rolling restart. You can't just yank the plug and upgrade all nodes because that would make the cluster unavailable and cause all sorts of issues with split brains, data loss, and worse. This is exactly the kind of thing you normally want an experienced DevOps person to handle responsibly. DevOps is full of tasks where experience matters, because the wrong shortcut can quietly turn into an outage or data loss.

The XXX Elasticsearch cluster upgrade is complete.

All three nodes are now on 9.3.1:

es1 9.3.1
es2 9.3.1
es3 9.3.1
Final cluster state:

green
3 nodes / 3 data nodes
0 relocating shards
0 initializing shards
0 unassigned shards
Notable preflight/result details:

Latest 9.3.x release verified: 9.3.1
Snapshot health passed before rollout; latest successful snapshot was 2026_03_12_05_33
System feature migration was required and completed before the first node restart
I did the rollout one node at a time in this order:

es3
es2
es1
No code changes were needed in the repo.

This morning, I asked Codex to reprovision the cluster in our old data center. I last did this in September, manually. It first did a quick inventory check and a dry run on one of the nodes. It verified the cluster was in a healthy state (green) and that the latest snapshot was verified. Then it went ahead and executed the rollout. It followed the plan to the letter. The block above is the report it gave me when it finished. The whole thing took about 20 minutes. I was on standby giving permission the few times it came back to ask for permission for the next step, as my skill for this specifies.

Before doing this, I iterated with it on writing a skill for doing this as part of my Hetzner migration. The skill covers what an experienced DevOps person would normally do: preflight checks before kickoff, confirmation gates that ask me for permission, and guidance for "what if this happens" scenarios. There's plenty of advice on the internet, and I even wrote about this exact topic years ago in Running Elasticsearch in a Docker 1.12 Swarm. Writing blog articles was something I used to do more regularly as a way of saying, "I should remember this weird thing I just spent 2 days figuring out so I don't have to spend that time again." It's the pre-AI way of creating and recording skills.

If you are interested, you can find the skill file I used here.

What's next?

Over the past few weeks, I've been planning and executing a ridiculous amount of work. I've launched two new websites via Cloudflare. I've created a few new OSS projects. I've done some major surgery on our two live deployments. I've also shipped a few major new features. Somehow, I have also found some time to try out OpenClaw and play with some new AI stuff. I've compressed months of work into a few weeks. I'm not going to lie: I'm exhausted, but I'm also energized. This is crazy fun.

Next on my agenda for modernizing DevOps bullshit that I don't want to deal with is getting some world-class AI monitoring and alerting in place. I need telemetry, logging, and all the rest. I have some of that already, but having it and actually using it are two different things. I want an AI to handle the operational discipline part: checking uptimes, verifying backups, watching resource usage, and making sure everything is working as it should. I want it to give me daily reports, summarize what matters, and escalate issues. I don't want to take a sabbatical to set all this up manually. I just want to get this shit done.

If this sounds like something your team needs

One of the other things I did with Codex recently was launch our AI services and consulting site: formationxyz.com.

The pitch is simple. A lot of companies can see the opportunity with AI, but they struggle to turn that into practical systems, useful workflows, and actual leverage for their teams. That is exactly the gap we want to help close.

At FORMATION XYZ, we help small teams automate repetitive work, build practical AI systems, and put agentic workflows in place that reduce manual effort and create more capacity. If the kind of work I described above sounds interesting to you, and you want help applying AI inside your company in a pragmatic way, we can help.