DEV Community: DevOps AI ToolKit

The 12 DevOps Errors That Page Teams Most (And the First Thing to Check)

James Joyner — Wed, 15 Jul 2026 14:58:06 +0000

Over the last while I've been cataloguing production DevOps errors — the exact strings that show up in logs at 2 a.m. — and writing a fix for each one. A pattern jumps out fast: a small number of errors account for a huge share of the pages. Here are the twelve that come up most, with the one-thing-to-check-first for each.

None of these are exotic. That's the point. The stuff that actually pages you is rarely exotic — it's the same dozen failure modes wearing different hats.

1. `CrashLoopBackOff`

The pod started, died, and Kubernetes is now backing off between restarts. CrashLoopBackOff is a symptom, never a cause. Go straight to kubectl logs <pod> --previous — the logs from the crashed container are where the real error lives. Nine times out of ten it's a bad config value, a missing env var, or a failed migration on startup.

2. `ImagePullBackOff` / `ErrImagePull`

Kubernetes can't pull the image. Don't guess — kubectl describe pod spells it out in Events. It's almost always a typo in the tag, a missing imagePullSecret, or a registry rate limit.

3. `OOMKilled` (exit code 137)

This is not "the node ran out of memory." It's "this container hit its own cgroup memory limit and the kernel killed it." Different problem, different fix. Compare the pod's resources.limits.memory against what it actually uses (kubectl top pod) before you touch anything at the node level.

4. `No space left on device`

The classic — and the trap is when df -h shows free space anyway. Then it's one of two things: you're out of inodes (df -i), or a process is holding a deleted-but-still-open file (lsof +L1). rm won't reclaim that space until you restart the process holding the file descriptor.

5. DNS timeouts inside pods

An external lookup that works from the node but intermittently times out inside a pod is almost always the ndots:5 search-domain cascade colliding with a conntrack UDP race — you get a flat 5-second stall that blows your client timeout. Overriding ndots on the pod spec and running NodeLocal DNSCache is the fix.

6. `FATAL: sorry, too many clients already` (Postgres)

Bumping max_connections is the trap, not the fix — each connection costs real memory. You need a pooler (PgBouncer), not 500 backend processes.

7. `Connection refused`

Something reached the host and nothing was listening on that port. It's rarely DNS or the network — it's the service being down, bound to 127.0.0.1 instead of 0.0.0.0, or a firewall. ss -tlnp on the target tells you in one line.

8. `TLS handshake timeout`

Usually not a cert problem at all — it's a network path problem (MTU, a proxy, or a firewall silently dropping the handshake) masquerading as TLS. Test raw connectivity first with openssl s_client -connect host:443.

9. `Read-only file system`

A filesystem that was mounted read-write and is suddenly read-only almost always means the kernel remounted it ro after detecting I/O errors. Check dmesg — you may be looking at a failing disk, not a permissions issue.

10. `Multi-Attach error for volume`

A ReadWriteOnce volume can attach to exactly one node at a time — not one pod, one node. If a node goes NotReady with the volume still attached, a pod rescheduled elsewhere gets this error. Kubernetes waits ~6 minutes before force-detaching on purpose — to protect your data from being written by two hosts at once.

11. `502 Bad Gateway`

502 means your proxy reached the upstream and the upstream said no (or died). It's rarely the proxy. connect() failed (111: Connection refused) in the NGINX error log → your app isn't listening where the proxy thinks it is.

12. `exec format error`

You built an image for one CPU architecture and ran it on another (hello, Apple Silicon → x86 clusters). Build multi-arch, or match your --platform.

The pattern

Every one of these has the same shape: the error message describes the symptom the system noticed, not the cause you need to fix. CrashLoopBackOff isn't why your pod is dying. OOMKilled isn't the node. The skill isn't memorizing fixes — it's knowing which single command turns the symptom back into a cause.

I keep a full, searchable library of these — every error above has a complete guide with the diagnostic workflow, an example root-cause analysis, and the prevention checklist. If you want the deeper version of any of them:

The Kubernetes ones live in the Kubernetes troubleshooting toolkit
Everything else is in the full error-guide library (Linux, Postgres, Docker, NGINX, and more)

What's the error that pages your team most? Curious whether it's on this list or something I should go write up next.

I Built Free Browser-Based Validators for YAML, Kubernetes and Terraform (No Upload, No Signup)

James Joyner — Tue, 14 Jul 2026 15:32:55 +0000

Every DevOps engineer has done this dance: you've got a chunk of YAML or a Terraform file that looks right, something's rejecting it, and you want a fast sanity check. So you paste it into some random online validator — and a small voice asks, wait, where did that config just go?

That config often has structure, comments, sometimes internal hostnames or resource names in it. Pasting infrastructure definitions into an unknown server is a habit worth breaking. So I built a set of validators that never send your config anywhere — they run entirely in your browser.

What they are

Free, browser-based validators for the formats DevOps folks paste-and-pray most:

YAML — catches the indentation and structure errors that make Kubernetes and CI configs fail with cryptic messages
Kubernetes manifests — schema-aware checks beyond "is it valid YAML," so you catch the wrong apiVersion or a misplaced field before kubectl apply does
Terraform / HCL — structural validation for the syntax slips that terraform validate flags only after you've context-switched away

The one design decision that matters

100% client-side. No upload, no signup, no server round-trip. Your config is parsed by JavaScript running in your own tab — it never leaves your machine. You can literally open dev-tools, watch the network panel, and see nothing go out. Turn off your wifi and they still work.

This isn't a privacy gimmick — it's the correct architecture for a tool that handles infrastructure definitions. A validator has no business seeing your config on a server it doesn't need to.

Why I bother

Two reasons, honestly.

One: I kept wanting this exact thing and kept not trusting the options. The nth time I hesitated before pasting a manifest into a stranger's website, I decided to just build the version I'd trust.

Two: fast feedback loops are the whole game in this job. The gap between "save the file" and "find out it's malformed" is pure friction — and the tighter that loop, the less of your working memory it burns. A validator that's one tab away and gives an answer in milliseconds is a small thing that compounds.

Try them

The validator workbench — YAML, Kubernetes, and Terraform, all client-side

If you're the kind of person who'd rather script it, a lot of the underlying tooling is open source — CLIs and a small read-only API for the prompt and error-guide data — over on the developer page and the GitHub org.

Client-side tools have real limits — they can't know your cluster's live state, and schema validation isn't the same as a policy check. But for the "did I just fat-finger the indentation" question, having the answer without a network request is exactly the trade I want.

What config format do you most wish had a trustworthy, offline, no-signup validator? That's genuinely how I decide what to build next.

It Works on My Machine: A Docker War Story About exec format error

James Joyner — Fri, 10 Jul 2026 14:45:41 +0000

"It works on my machine" is the oldest joke in software, and containers were supposed to kill it. Same image everywhere, same behavior everywhere — that's the whole pitch. So there's a special kind of betrayal when a container that runs perfectly on your laptop lands in the cluster and dies instantly with four unhelpful words:

exec /app/server: exec format error

Here's the afternoon that error cost me, and the thing it turned out to be teaching.

The setup

Built the image locally on a shiny new laptop. Ran it locally — perfect. Pushed it, the deploy rolled out, and every pod went straight into CrashLoopBackOff. kubectl logs showed the line above and nothing else. No stack trace, no panic, no hint. The binary that ran fine thirty seconds ago on my machine refused to execute at all in prod.

The maddening part, same as it always is: the exact same image. That's the container promise. How can the same bytes run in one place and be unrunnable in another?

The tell I walked right past

exec format error is the kernel's way of saying "I tried to execute this file and I don't recognize the format." Not "permission denied," not "not found" — I literally cannot run this shape of binary.

And the shape of a binary that a kernel can or can't run is its CPU architecture. My shiny new laptop was Apple Silicon — arm64. The cluster nodes were amd64. I'd built an arm64 binary, wrapped it in an image, and shipped it to machines that speak a different instruction set. Locally it ran because I was running it on the architecture I built it for. The moment it hit an amd64 node, the kernel looked at my arm64 executable and said, correctly, "I don't know how to run this."

Nothing was broken. Docker did exactly what I asked — it built an image for the platform I was on and faithfully shipped it. I just never told it that "the platform I'm on" and "the platform this runs on" were different.

Confirming it

Two commands make it obvious:

# what architecture is this image built for?
docker image inspect myimage:tag --format '{{.Architecture}}'
# arm64   ← there's the problem

# what do the target nodes run?
kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.architecture}'
# amd64 amd64 amd64

arm64 image, amd64 nodes. Mystery over.

The fix

Stop building for "wherever I happen to be" and start building for where it runs. docker buildx builds multi-arch images from a single command:

docker buildx build --platform linux/amd64,linux/arm64 \
  -t registry/myimage:tag --push .

Now the registry holds both architectures under one tag, and every node pulls the variant it can actually run. If you only ever deploy to amd64, you can just pin that: --platform linux/amd64. Either way, the key is that the build platform is now a decision, not an accident of what laptop you bought.

What it was actually teaching

The container promise isn't "the same image runs everywhere." It's "the same image runs everywhere that shares the contract it was built against" — and CPU architecture is part of that contract, an invisible part that used to be uniform and quietly stopped being uniform the day ARM laptops got good.

That's the pattern behind almost every "works on my machine" that survives containerization: some assumption from your environment rode along inside the image without you noticing — an architecture, a mounted file that only exists locally, an env var your shell sets and prod doesn't. The container didn't lie. It faithfully packaged your assumptions and carried them somewhere the assumptions weren't true.

The fix is always the same discipline: make the invisible contract explicit. Build for the target, not the desk you're sitting at.

I keep the full library of Docker gotchas like this one — the diagnostic commands, the root cause, the prevention — for the next time one of them eats an afternoon:

The Docker troubleshooting toolkit, and the executable file not found in $PATH guide for its close cousin (the other "your binary won't run" error).

What's your favorite "same image, different result" story? The ARM-laptop-to-x86-cluster one has bitten a lot of people since about 2021 — I doubt I'm the last.

The 10 Docker Errors That Waste the Most Time (and the One-Line Fix)

James Joyner — Fri, 10 Jul 2026 01:54:11 +0000

Docker is fantastic right up until it throws one of its greasy, context-free error messages at you and you lose twenty minutes to a thing that has a one-line fix. I've been collecting these — the exact strings, and the first thing to check for each. Here are the ten that eat the most time.

1. `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`

The engine isn't reachable. In order of likelihood: the daemon isn't running (systemctl status docker), you're not in the docker group (sudo usermod -aG docker $USER, then log out and back in), or you're pointing at the wrong DOCKER_HOST. It's almost never Docker being broken — it's Docker not being up or you not being allowed.

Full guide →

2. `no space left on device`

Docker hoards. Dangling images, stopped containers, unused volumes and build cache pile up on the Docker root disk. docker system df shows you where it went; docker system prune -a --volumes reclaims it (read what it'll delete first). If df -h says you have space but Docker disagrees, you may be out of inodes.

Full guide →

3. `Bind for 0.0.0.0:8080 failed: port is already allocated`

Something already owns that port — often a container you forgot was running. docker ps to find it, or ss -tlnp | grep 8080 for a non-Docker process. Stop the holder or map to a different host port.

Full guide →

4. `pull access denied ... repository does not exist or may require 'docker login'`

Three flavors: the image name/tag is wrong, it's a private registry and you're not authenticated (docker login), or you've hit Docker Hub's anonymous pull rate limit. The error says "does not exist OR requires login" for a reason — check both.

Full guide →

5. `exec format error`

You built the image for one CPU architecture and ran it on another — the classic Apple Silicon (arm64) build landing on an amd64 server. Build multi-arch with docker buildx, or pin --platform to match your target.

6. `OCI runtime create failed`

A low-level container-start failure. The useful part is always after the colon — a missing binary, a bad mount, a permissions problem. Read the full message; OCI runtime create failed itself tells you nothing.

Full guide →

7. `executable file not found in $PATH`

Your CMD or ENTRYPOINT points at a binary the image doesn't have — often because a slim/distroless base doesn't ship a shell, or you assumed a tool was installed. Check exec-form vs shell-form and confirm the binary actually exists in the final layer.

Full guide →

8. `TLS handshake timeout`

Usually not a cert problem — it's a network path issue (a proxy, MTU, or firewall) between you and the registry, masquerading as TLS. Test raw connectivity before you touch certificates.

Full guide →

9. `failed to compute cache key: ... not found` (COPY/ADD)

Your Dockerfile is trying to COPY a file that isn't in the build context — either the path is wrong, or .dockerignore is excluding it. Remember paths are relative to the context root, not the Dockerfile.

10. `Conflict. The container name "/x" is already in use`

A container with that name already exists (running or stopped). docker rm x to remove the old one, or use --rm / a fresh name. Common in CI where a previous run didn't clean up.

The pattern

Nearly every Docker error puts the useful information after the colon and a generic category before it. OCI runtime create failed is the category; the cause is the clause you skimmed past. Train yourself to read to the end of the line before you start googling.

I keep complete guides for all of these — and about eighty more Docker errors — each with the diagnostic workflow, a worked root-cause example, and the prevention checklist:

The Docker troubleshooting toolkit — the top errors, launcher, and runbooks in one place

Which Docker error has personally cost you the most hours? Genuinely curious which of these tops the list for other people.

How I Cut a Docker Image From 1.2GB to 180MB

James Joyner — Thu, 09 Jul 2026 02:32:56 +0000

A while back I inherited a service whose Docker image was 1.2GB. Pulls were slow, the CI cache was useless, and the deploy step took long enough that people context-switched away and forgot about it. I got it down to about 180MB without changing a line of application code. Here's exactly what moved the needle, roughly in order of impact.

1. Multi-stage builds (the big one)

The single biggest win. The original Dockerfile built the app and shipped the entire build toolchain along with it — compilers, dev headers, the full package cache. None of that is needed at runtime.

# build stage — has all the heavy tooling
FROM node:20 AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# runtime stage — starts clean, copies only the artifact
FROM node:20-slim
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
CMD ["node", "dist/server.js"]

The runtime image never contains the build tools. That alone took roughly 500MB off.

2. Pick a smaller base

node:20 is Debian with everything. node:20-slim drops a couple hundred MB. If your app is a static binary (Go, Rust) you can go all the way to distroless or scratch and ship just the binary — no shell, no package manager, no OS to speak of. Smaller base = smaller image and a smaller attack surface, which your security team will also thank you for.

The trade-off: distroless has no shell, so docker exec ... sh won't work for debugging. Know that going in.

3. Order layers by how often they change

Docker caches layers top-down and invalidates everything after the first change. If you COPY . . before installing dependencies, every code change busts your dependency cache and reinstalls everything.

Copy your lockfile and install deps first, then copy the rest of the source:

COPY package*.json ./
RUN npm ci          # cached until dependencies actually change
COPY . .            # changes every commit, but deps stay cached

This didn't shrink the final image much, but it turned a 4-minute rebuild into a 20-second one.

4. Add a real `.dockerignore`

Without it, COPY . . drags your entire .git history, node_modules, local .env files, test fixtures, and CI logs into the build context — bloating the image and leaking things you don't want baked into a layer.

.git
node_modules
*.log
.env*
dist
coverage

5. Collapse and clean up `RUN` layers

Every RUN is a layer, and deleting files in a later layer doesn't shrink the earlier one. Install, use, and clean up in a single RUN:

RUN apt-get update \
 && apt-get install -y --no-install-recommends some-tool \
 && rm -rf /var/lib/apt/lists/*

The rm has to be in the same RUN as the apt-get, or the cache still ships in the layer beneath it.

The results

	Before	After
Image size	1.2 GB	~180 MB
Cold pull	~90s	~12s
Cached rebuild	~4 min	~20s

None of this is exotic — it's multi-stage, a slimmer base, layer order, .dockerignore, and cleaning up in place. But together they turn a deploy you dread into one you don't think about.

If you want the deeper reference — including the Docker errors these optimizations sometimes surface (no space left on device, cache-key failures, and friends) — I keep a full set:

The Docker troubleshooting toolkit and the no space left on device guide for when the build disk fills up mid-optimization

What's the smallest you've gotten a real production image (not a hello-world)? Always looking for tricks I haven't tried.

7 Dockerfile Mistakes That Are Quietly Costing You

James Joyner — Wed, 08 Jul 2026 15:55:42 +0000

Most Dockerfiles work. That's the problem — "it builds and runs" hides a lot of quiet costs in security, speed, and size that don't announce themselves until an audit, an incident, or a cloud bill does it for them. Here are seven mistakes I see constantly, and what to do instead.

1. Running as root

By default, the process in your container runs as root — and if someone breaks out, they're root on a surface they shouldn't be. Add a non-root user and switch to it:

RUN useradd --system --uid 10001 appuser
USER appuser

Cheap, and it closes off a whole category of "well, at least it wasn't root" incidents.

2. `FROM some-image:latest`

latest is not a version — it's "whatever was newest when this happened to build." Two builds a week apart can produce different images with no diff to explain it, and a surprise base upgrade is a fun way to spend a Friday. Pin a specific tag, ideally by digest:

FROM node:20.11.1-slim

3. Baking secrets into layers

COPY .env . or ARG API_KEY followed by using it — and now the secret lives in an image layer forever, recoverable by anyone who pulls the image, even if a later layer deletes the file. Layers are immutable and additive; you can't delete your way out of a leak. Use build secrets (--mount=type=secret) or inject at runtime, never at build.

4. No `.dockerignore`

Without one, COPY . . sweeps your .git directory, local env files, node_modules, and test data into the build context — bloating the image and, worse, potentially baking credentials and history into a layer. A five-line .dockerignore is one of the highest-leverage files in the repo.

5. Layer order that destroys your cache

COPY . .
RUN npm ci      # ← reinstalls on EVERY code change

Docker invalidates every layer after the first change. Copy the lockfile and install dependencies before copying the rest of your source, so a one-line code change doesn't trigger a full reinstall. This is a build-speed bug hiding as a style choice.

6. Leaving package manager cruft in the image

RUN apt-get update && apt-get install -y curl

That leaves the apt lists sitting in the layer. And cleaning them in a separate RUN doesn't help — the bytes are already committed to the earlier layer. Do it all in one:

RUN apt-get update \
 && apt-get install -y --no-install-recommends curl \
 && rm -rf /var/lib/apt/lists/*

7. No `HEALTHCHECK`

Without one, Docker (and your orchestrator) only knows whether the process is alive — not whether the app can actually serve. A container can be "up" and completely wedged. A healthcheck that hits a real endpoint lets the platform notice and recycle it:

HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8080/healthz || exit 1

(Make sure that endpoint checks something real, not just "is the web server running" — but that's a whole other article.)

The theme

None of these break the build. They surface later — as a security finding, a slow pipeline, a bloated registry, or a container that's "healthy" while failing. The fixes are all a line or two; the hard part is remembering them at 4 p.m. on a Dockerfile you just want to ship.

Two things that help me not forget:

I run Dockerfiles and Compose files through a client-side validator (runs in your browser, nothing uploaded) to catch the structural stuff before it ships.
And I keep a Docker troubleshooting toolkit for when one of these mistakes graduates into an actual error at runtime.

Which of these did you learn the hard way? Mine was #3, and I think about it more than I'd like to admit.

Reading Loki Logs With AI: Patterns That Work

James Joyner — Tue, 07 Jul 2026 15:43:17 +0000

If you've adopted Loki for log aggregation, you've probably had this moment: you need to find something in your logs right now, you open Grafana, and you stare at the empty LogQL query bar trying to remember whether it's |= or =~ for the substring filter. Five minutes later you've cobbled something together, run it, gotten zero results, and you're not sure if the query is wrong or the logs aren't there.

This is the kind of friction AI is good at removing. LogQL has a small, structured grammar; the model knows it; you describe what you want; you get a working query. But — and this is the recurring theme — the model will also sometimes produce queries that are syntactically valid and semantically wrong, and there's a specific way to catch that.

The basics AI handles well

These are the patterns I use AI for daily without much verification:

Label filter + substring:

Give me a LogQL query that finds error messages from the payments app in the production namespace over the last hour.

Reliable output:

{namespace="production", app="payments"} |= "error"

Counting by level:

Count the rate of log entries per level for the web app over 5-minute windows.

Reliable output:

sum by (level)(rate({app="web"} | json | __error__="" [5m]))

JSON parsing + extraction:

Show me request durations from the api app where the JSON duration_ms field is greater than 1000.

Reliable output:

{app="api"} | json | duration_ms > 1000

These are exactly the kinds of queries that take me 5 minutes to write from memory and 5 seconds to get from Claude. Worth it.

Where AI gets LogQL wrong

Confusing PromQL and LogQL syntax

The model has read more PromQL than LogQL, and sometimes it leaks. You'll get a query like:

rate({app="web"}[5m]) by (level)

That by (level) placement is PromQL syntax. In LogQL, you need:

sum by (level)(rate({app="web"}[5m]))

The Grafana editor catches this and tells you it's a syntax error. But if you're using logcli or the API directly, you might get a confusing error and not realize the issue is structural.

Using labels that aren't indexed

LogQL is fast when you filter on indexed labels (the ones in {}). It's slow when you filter on extracted fields after | json. The model doesn't know which of your labels are indexed; it'll happily put high-cardinality fields in the curly braces.

If you ask:

Find all requests from user 12345.

You might get:

{app="api", user_id="12345"}

If user_id is not a stream label (and it usually shouldn't be — it's high cardinality), this query is invalid and Loki rejects it. The correct query is:

{app="api"} | json | user_id="12345"

When you describe the query, tell the model which labels are in the stream selector vs which are JSON fields. Otherwise it guesses.

Inventing operators

LogQL has |=, !=, |~, !~ for line filtering. The model sometimes invents contains, like, or other operators that don't exist. The query fails with a parse error.

Easy to catch — the parse error tells you exactly which token is wrong — but worth knowing as a class of failure.

Confused unwrap behavior

For metric queries over log values (rate of a histogram, sum of a counter), you need | unwrap. The model sometimes uses | unwrap incorrectly or skips it when needed. The query runs but returns 0 or NaN, which looks like "no data" but is really "wrong aggregation."

This one is harder to catch because the query executes. You have to read the result and notice it's wrong.

A workflow for unfamiliar log shapes

When you're investigating logs you don't normally look at — different team's service, vendor product, etc. — there's a specific sequence I use:

Step 1: Get a sample

{app="unfamiliar-service"} | json

Run this against a small time window. Grafana shows you the parsed fields. Now you know what's available.

Step 2: Show the AI the sample

Paste a couple of representative log lines (sanitized) into Claude with:

Here are 3 sample log lines from a service I don't usually monitor. They're JSON. Tell me what each field appears to mean and what would be good labels to filter on.

The model reads the structure and tells you which fields are useful. This takes 30 seconds and gives you a mental model of the log shape.

Step 3: Generate the query

Now you can ask for a specific query with confidence that the field names you give the model are real:

Generate a LogQL query that filters the unfamiliar-service app for entries where status_code is 500 or 503 and duration_ms is over 200.

The result will use the correct field names because you told the model what they are.

Step 4: Verify before alerting

If the query is going into an alert, run it against historical data and check the results. The model doesn't know your baseline. A query that returns "no data" right now might return huge volumes during a normal incident, or vice versa.

A trap worth flagging

Loki's query frontend caches results aggressively. If you're iterating on a query and the AI changed something subtle, you might get cached results from the previous query and think your change didn't take effect.

Fix: When iterating, change the time range slightly between queries (or use instant queries). This bypasses the cache.

Logs as a debugging input for AI

A separate but related use case: pasting logs into the AI to get help debugging. This works better than I expected for logs that have clear structure (JSON, logfmt) and worse for noisy unstructured logs.

The trick is to paste a window around the suspected issue — not the whole log file. Five minutes around the incident is usually plenty. The model spots patterns ("you have 47 OOM kills in this window, all on pods in the payments namespace") that I'd miss manually.

But: keep the volume sane. A 5MB log paste degrades the model's attention. If you have 10,000 lines, filter to the relevant subset first.

The pattern that ties it together

Most of what I've described is the same shape: give the model a small amount of accurate context (sample logs, label names, time range), then let it generate the LogQL. The failures all come from skipping the context step and hoping the model can guess your schema.

For prompts on Loki specifically, see the Loki log aggregation design and the Grafana logs panel patterns. For PromQL, the related PromQL query optimization prompt covers similar territory.

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

The Pod That Lied: A Kubernetes Readiness Probe War Story

James Joyner — Tue, 07 Jul 2026 07:33:12 +0000

Last time I told you to bring coffee. I did. This one didn't start at 2 a.m. — it started at 10:40 on a perfectly ordinary Tuesday, which, if anything, is worse. There's a particular kind of dread that only arrives in full daylight, when you're well-rested and caffeinated and therefore have no excuse for not understanding what's happening. And I did not understand what was happening.

Welcome back to Troubleshooting Kubernetes. Today's episode is about the most unsettling category of outage there is: the one where nothing is broken. Every light is green. Every pod is Running. Every dashboard is calm. And your users are getting 500s.

I've been doing this long enough — a couple of decades and change, across a few places I've been lucky to work — to have developed a healthy distrust of green. Green is not a fact. Green is a claim. And on that Tuesday, something was lying to me with a very straight face.

The symptom that wasn't there

The error rate on the checkout API had crept from basically-zero to about 18% over twenty minutes. Not a spike. A creep — the kind that makes you wonder if it's real or if the graph is just having feelings. So, first move, always: is this real, or is the monitoring lying?

# hit it through the front door, a bunch of times
for i in $(seq 1 20); do
  curl -s -o /dev/null -w "%{http_code}\n" https://checkout.internal/api/health-of-a-real-endpoint
done

Real. About one in five came back 500. And "about one in five" is a clue, not just a number — an intermittent failure spread across requests usually means it's spread across replicas. Some of the pods behind the Service were fine. Some were not. The load balancer was cheerfully dealing everyone a random card from a deck that had a few jokers in it.

So I went to look at the pods, fully expecting to find a couple of them crash-looping or NotReady or otherwise waving their hands.

kubectl get pods -n checkout -o wide

Five replicas. Running. 1/1. Ready. Every last one of them, calm as a millpond. Zero restarts. According to Kubernetes, this service was in perfect health. According to the customers, it was on fire. Those two things cannot both be true, and yet there they were, both being true, mocking me over my second coffee.

Finding the jokers

If the Service is routing to five pods and some are bad, I want to know which ones — go around the load balancer and interrogate each pod directly. The endpoints list tells you who's actually in rotation:

kubectl get endpoints checkout-api -n checkout -o wide

All five pod IPs, present and accounted for, all listed as ready targets. Fine. Let's talk to them one at a time.

# talk to a single pod, bypassing the Service
kubectl port-forward pod/checkout-api-6c8b9-x2k7p 8080:8080 -n checkout
curl -s localhost:8080/api/checkout/quote -d '{...}' # returns 500
curl -s localhost:8080/healthz                        # returns 200 OK

There it was. On the bad replicas, the real endpoint threw a 500, and the health endpoint returned a serene, confident 200 OK. The pod was, in the most literal sense, telling Kubernetes it was ready to serve — while being completely unable to serve.

Two of the five were doing this. And Kubernetes, being a faithful and literal machine, kept both of them in the endpoints list and kept dealing them to real customers, because they had passed the only test it knew to run.

The lie, and who told it

Here's the readiness probe those pods were configured with — and I'd bet real money it looks familiar, because it's the one everybody writes on their first day and never revisits:

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10

And here, roughly, is what /healthz did:

GET /healthz  →  "am I an HTTP server that is currently running?"  →  200

That's it. That's the whole check. It confirms the process is up and the socket is listening. It does not ask the one question that actually matters to a customer: can you do your job right now? It never touches the database. It never checks the connection pool. It is a smoke detector wired to confirm that it has electricity.

What had actually happened: the database had done a failover about twenty-five minutes earlier — routine, expected, mostly graceful. Most of the app replicas noticed, dropped their stale connections, and reconnected to the new primary. Two of them didn't. Their connection pools were wedged full of dead connections to an IP that no longer answered, so every real query timed out into a 500. But the HTTP server? Still up. Still listening. Still cheerfully answering /healthz with a 200.

So the pods weren't broken from Kubernetes' point of view. They were doing exactly what they'd promised. The lie wasn't Kubernetes'. Kubernetes was the most honest actor in the whole incident — it did precisely what it was told, with no imagination whatsoever. The lie was in the probe. We had told it that "the web server is up" meant "this pod can serve customers," and it had believed us, because why wouldn't it. The green dashboard was technically, uselessly accurate.

I have a lot of respect for that, honestly. It's the same trait that saved my data in the last installment — Kubernetes doing the literal thing, refusing to be clever. It'll protect you from yourself and it'll faithfully execute your mistakes with equal diligence. The machine isn't the problem. The machine is a mirror.

Getting the jokers out of the deck

Two things to do, in order: stop the bleeding, then fix the actual bug.

Stop the bleeding. The two wedged replicas just needed their connection pools reset, and the fastest way to reset a pod's everything is to let the Deployment give you a fresh one:

kubectl delete pod checkout-api-6c8b9-x2k7p checkout-api-6c8b9-9m4tz -n checkout

New pods came up, connected cleanly to the new primary, and the 500s stopped inside a minute. Error rate back to zero. Crisis over. And this is exactly the moment where a tired engineer declares victory, closes the incident, and goes to lunch having fixed nothing — because the bug wasn't the wedged pool. Connection pools wedge. Databases fail over. That's Tuesday. The bug was that a pod that couldn't serve was allowed to keep serving.

Fix the actual bug. The readiness probe has to check what actually matters. A real /readyz that returns 503 when the pool can't hand out a working connection:

readinessProbe:
  httpGet:
    path: /readyz          # checks a DB connection, not just the socket
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

Now a wedged replica reports itself not ready, Kubernetes pulls it from the Service endpoints automatically, and traffic routes only to pods that can actually do the work. The self-healing you thought you already had, you now actually have.

But — and this is the part people skip, the part that turns a fix into a second outage — do not get greedy with your readiness probe. If every replica shares one database, and you wire readiness directly to that database, then the next time the DB hiccups for fifteen seconds, all five pods simultaneously report not-ready, Kubernetes yanks every endpoint, and now you've converted a fifteen-second blip into a total, self-inflicted outage with a thundering-herd reconnect on the far side. The probe should reflect this pod's ability to serve, with enough tolerance (failureThreshold, sane timing) that a shared, transient dependency wobble doesn't get amplified into a coordinated group suicide. Readiness pulls a sick pod from rotation. It should not pull a nervous one.

And keep liveness and readiness doing different jobs. Liveness answers "should I be restarted?" — reserve it for genuinely wedged, unrecoverable states, because a liveness probe tied to a shared dependency is how you turn a database blip into a cluster-wide restart storm. Readiness answers "should I get traffic?" Conflate them and you'll eventually get both failure modes at once, which is a bad day I'll tell you about some other time.

Why green lies, and why I still love this

The lesson underneath the lesson: a health check is a promise your application makes about itself, and your monitoring is only ever as honest as that promise. The dashboard wasn't wrong. It was faithfully reporting a claim that happened to be worthless. If you take one thing from this: write the probe that checks the thing your users actually depend on, not the thing that's easy to check. The gap between those two is where 10:40-on-a-Tuesday lives.

That gap — the difference between "the process is up" and "the thing actually works" — is basically the whole reason I started keeping structured notes on this stuff, and eventually turned them into a site. When you're staring at a green dashboard and a red reality, the useful thing isn't a metrics graph, it's a hypothesis: "some replicas can't reach a dependency; check the endpoints and probe each pod directly." I'll sometimes paste the raw logs into the AI incident assistant I keep on the site just to get to that first hypothesis faster — not because it knows the answer, but because at minute three of an incident, a decent starting question is worth more than another dashboard. Sanitize your secrets before you paste, obviously. The judgment is still yours; the tool just hands you the thread to pull.

Here's the thing I keep coming back to, twenty-odd years in: this incident was beautiful, in the specific way that only a good bug is beautiful. Nothing was broken. Every component behaved exactly as designed. And the emergent result was still wrong, because of a single lazy assumption baked into six lines of YAML that someone — probably me, honestly — wrote in a hurry two years ago and never looked at again. Untangling that, watching the green dashboard finally mean something, is a feeling I have never once gotten tired of. The pages are annoying. The 10:40 Tuesdays are worse. And I would not trade this job for a quieter one, because a quieter one wouldn't let me do that.

Next time in Troubleshooting Kubernetes: the DNS lookup that resolved perfectly from my laptop, resolved perfectly from the node, and failed only — only — from inside the pod. Bring coffee. Bring patience. Bring a copy of the ndots documentation you've never actually read.

The running set of Kubernetes runbooks, probe patterns, and error guides I keep so I'm not re-deriving them mid-incident lives on the Kubernetes toolkit. Green is a claim. Make your probes tell the truth.

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

The Volume That Wouldn't Let Go: A Kubernetes PVC War Story

James Joyner — Tue, 07 Jul 2026 07:33:08 +0000

The page came in at 2:14 a.m., which is the only time pages ever come in. I have a theory that PagerDuty holds them in a little queue until it's certain you're in the deepest part of your sleep cycle, and then it releases them all at once, like a cat knocking a glass off a table while making eye contact. I've been doing this work for twenty-five years — long enough to have carried a literal pager at Yahoo, long enough to have learned resilience the hard way at Netflix, long enough that a 2 a.m. alert no longer produces adrenaline so much as a kind of tired affection. Here we go again. Let's see what you've got.

What it had was a payments-adjacent service that had stopped serving. Not slow. Not flapping. Stopped. The kind of outage where the graph doesn't decline, it just falls off the edge of the world.

This is the first in a series I'm calling Troubleshooting Kubernetes — war stories from the cluster, with the actual fixes, told by someone who genuinely loves this work and is also deeply, professionally tired. Those two things aren't in tension. They never have been.

The symptom, and the small lie it told

First move, always the cheapest one: look at the pods.

kubectl get pods -n payments -o wide

Two of three replicas were Running. The third — the one that had presumably rescheduled after something went sideways — was sitting in ContainerCreating, and had been for six minutes. Six minutes is an interesting number in Kubernetes. Hold onto it.

ContainerCreating is Kubernetes' way of telling you it's trying, which is the most maddening status there is. It's not an error. It's not a crash loop you can grep for. It's a shrug. So you ask it to be specific:

kubectl describe pod payments-api-7d9f5-abcde -n payments

And there, down in the Events, was the sentence that would define my next twenty minutes:

Warning  FailedAttachVolume   attachdetach-controller
  Multi-Attach error for volume "pvc-4a1b...": Volume is already
  exclusively attached to one node and can't be attached to another

Now, if you've never met this error, it reads like an accusation. Multi-Attach. It sounds like I did something greedy — like I tried to bolt the same disk onto two machines at once out of hubris. I did not. Kubernetes did, on my behalf, and then caught itself, and now it's blaming me. Classic.

Here's the honest part: my very first instinct was wrong. My tired brain went, "storage backend's having a bad night, page the storage team, go back to bed." That's the instinct talking, not the evidence. So I did the thing I always tell junior engineers to do and almost didn't do myself at 2 a.m.: I stopped guessing and started proving.

The mental model that actually matters

A ReadWriteOnce PersistentVolume — RWO, the default most people never think about until it bites them — can be attached to exactly one node at a time. Not one pod. One node. That distinction is the whole story. The volume was fine. The data was fine. The problem was that the volume was still, according to Kubernetes' bookkeeping, attached to a node the new pod wasn't running on.

Which node? Let's ask.

kubectl get volumeattachment | grep pvc-4a1b
kubectl get nodes

The VolumeAttachment object still pointed at node-17. And node-17, per kubectl get nodes, was sitting there in NotReady, looking innocent.

Not gone. Not deleted. NotReady.

And that, right there, is where the error stops being a bug and starts being Kubernetes doing exactly what I'd want it to do, if I weren't so annoyed at it. Because here's the thing the attach-detach controller knows that I, at 2 a.m., had briefly forgotten: a NotReady node is not a dead node. It's an unreachable node. Those are wildly different situations, and the difference is measured in whether or not you corrupt a filesystem.

If node-17 had genuinely died — power gone, instance terminated — then sure, force-detach the volume and let the new pod have it. But if node-17 is merely unreachable — a network partition, a wedged kubelet, a security group someone "cleaned up" — then the old pod might still be alive on that node, still writing to that disk. Rip the volume away and hand it to a second writer, and you've just mounted the same block device on two hosts at once. On a filesystem that assumes it's the only one home. That's not an outage anymore. That's a data-corruption incident with a much longer postmortem and a much worse tone.

So Kubernetes waits. By default, roughly six minutes — remember the six minutes? — before the controller-manager will force-detach a volume from an unreachable node. It's not being slow. It's being careful. It would rather your service be down for six minutes than your data be wrong forever. Given the choice, so would I. Given the choice at 2 a.m., I had to remind myself I agreed.

The fix, which was mostly a decision

So the actual work wasn't a command. It was a question: is node-17 coming back, or is it gone?

I checked the cloud console. The instance was still there, still "running," just not talking to the API server. kubelet had wedged after a bad network event — the node was up, the workload on it was long dead (the container had exited), but the kubelet hadn't been able to report that fact home. So the volume was safe to move. No live writer. I just had to convince the control plane of what I'd already confirmed with my own eyes.

The clean way, when you've verified the node is truly not running your workload, is to let the controller do its job by removing the thing it's protecting:

# Only after CONFIRMING nothing is still writing to that volume.
kubectl delete node node-17

Deleting the Node object tells the attach-detach controller the truth it couldn't discover on its own: that node is out of the picture. The controller detached the volume, the VolumeAttachment for the old node disappeared, the pending pod's attach succeeded, the container started, and the graph climbed back out of the hole it had fallen into. Total time down: about nine minutes, most of which was me confirming I wasn't about to do something stupid.

There's a rougher tool — deleting the VolumeAttachment object directly to force the issue — and I want to be careful here, because I've seen people reach for it first, like a fire axe, when the situation called for a key. Do that against a node that's still writing and you own the corruption. The order matters: prove the writer is dead, then detach. Never the other way around. The whole reason the six-minute timer exists is to save you from your own impatience. I have learned to respect it, mostly by having once not respected it.

If you want the version of this I wish someone had handed me the first time — the exact diagnostic sequence, the "is the node dead or just unreachable" decision tree, the prevention checklist — I ended up writing it down as a proper FailedAttachVolume / Multi-Attach error guide. Which brings me to the part of this story that isn't about storage at all.

The epiphany, sitting on my kitchen floor

Here's what I don't usually admit in these. While I was waiting out that timer — because sometimes the correct engineering action is to wait, and drink water, and not touch anything — I was doing something a little pathetic. I was grepping my own notes. A directory called ~/notes/ full of markdown files with names like k8s-storage-stuff.md and THINGS-I-FORGET.md, going back years. Some of them referenced Slack threads that no longer exist. One of them referenced a Confluence page that 404s now, at a company I no longer work at, describing a fix I was, at that exact moment, re-deriving from scratch.

And it landed on me, sitting on the kitchen floor with a laptop and a glass of water at 2:30 in the morning: I had solved this before. Not this exact incident, but this shape of problem — the RWO volume stuck on an unreachable node, the six-minute wait, the is-it-dead-or-just-quiet decision. I'd solved it at Netflix. I'd probably solved a cousin of it at Yahoo. I would solve it again. And every single time, the hard part wasn't the fix. The fix is knowable. The hard part was reconstructing the fix, under pressure, from a scattered archaeology of dead links and half-remembered context, while a service was down and a timer was running and my judgment was operating at 40% because it was 2:30 a.m.

The fix isn't the scarce thing. The fix, made retrievable at the exact moment you need it — that's the scarce thing.

That's a solvable problem. That's engineering. What if the error string you're staring at came pre-attached to the diagnostic sequence, the decision tree, the validator that would've caught the misconfiguration three commits ago, and a runbook written by someone who was calm when they wrote it? What if the next tired person at 2 a.m. didn't have to be me at my best to get to the answer — they just had to paste the error and get a real starting point?

I started building it on weekends. It became devopsaitoolkit.com — a growing pile of the runbooks, error guides, config validators, and copy-paste prompts I'd always wished existed in a single place instead of in my head and a dozen dead wikis. I use AI in it heavily, but not the way the hype cycle wants me to: not as an oracle, just as a fast way to turn a wall of logs into a first hypothesis you then go verify yourself. Sanitize your secrets before you paste anything into a model, always. The judgment stays with you. The tool just gets you to the starting line faster. That's the whole pitch, and it's an honest one, because I built it for me.

Why I still love this

I could tell you the tidy lesson — pin your node lifecycle, drain before you terminate, understand your access modes, don't reach for the fire axe — and all of that is true and I'll dig into each of it later in this series. But that's not really why I'm writing this.

I'm writing this because at 2:47 a.m., after the graph recovered and the pages went quiet, I sat there for a minute in the specific, ridiculous satisfaction of having understood a thing. The system wasn't broken. It was protecting something — my data, from me — and once I saw it that way, the fix was obvious and even a little beautiful. Kubernetes wasn't fighting me. It was doing its job with more patience than I had.

That's the part twenty-five years hasn't worn down. The pages are annoying. The org politics are worse. The tooling will disappoint you and the on-call rotation is a tax on your Tuesdays. And underneath all of it, there is still this small, durable joy in taking a thing that made no sense at 2:14 and understanding it completely by 2:47. I get to do that for a living. I'd be lying if I called that anything but lucky.

Next time in Troubleshooting Kubernetes: the pod that was Running and completely, confidently wrong. Bring coffee.

Storage access modes are one of those things that are obvious in hindsight and expensive in the moment. The running set of Kubernetes runbooks and error guides I keep — so I'm not rebuilding them under pressure — is exactly what that site turned into. Save the pattern, not just the fix.

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

AI Prompt Templates for Prometheus Alerting

James Joyner — Mon, 06 Jul 2026 17:45:17 +0000

Writing good Prometheus alerts is hard. Most alerts are too sensitive (page on every blip), too lax (miss real outages), or missing context (no runbook, no labels, no severity routing). AI assistants are unusually good at the grunt work of alert authoring — if you prompt them right.

Why generic alert generators fail

Type "write me a Prometheus alert for high CPU" into any AI and you'll get:

- alert: HighCPU
  expr: cpu_usage > 80
  for: 5m

Three things wrong already: cpu_usage isn't a real Prometheus metric, there's no rate() window, and for: 5m will flap on every cron job. You need a prompt that anchors the model in production reality.

The template structure

Our Prometheus Alert Rule Generator Prompt enforces:

Resilient PromQL — rate(), avg_over_time, or histogram_quantile() as appropriate.
Appropriate for: duration — long enough to avoid flap, short enough to detect real outages.
Severity labels and routing — severity, team, service.
Runbook annotation — every alert links to a runbook.
False-positive analysis — the model lists ways the alert could lie.

Three patterns worth saving

Pattern 1: Rate-based error alerts

Alert me when the 5-minute error rate exceeds 1% for at least 10 minutes, scoped per service.

Generated PromQL pattern:

expr: |
  sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
  /
  sum by (service) (rate(http_requests_total[5m]))
  > 0.01
for: 10m

Pattern 2: SLO-based latency

Alert when p99 latency exceeds my SLO threshold for 10 minutes.

expr: |
  histogram_quantile(0.99,
    sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
  ) > 0.8
for: 10m

Pattern 3: Saturation alerts

Alert when disk on any node will run out in < 4 hours based on current growth rate.

expr: |
  predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
for: 30m
labels:
  severity: warning

The predict_linear pattern is particularly nice — it pages you before the disk fills, not at 100%.

Validation: don't trust, verify

Before promoting any AI-generated alert to prod:

promtool check rules my-alerts.yml

Run it in your staging Prometheus first. Watch it for 24 hours. Check if it would have fired during recent incidents using promtool test rules.

Combining alert generation with runbook drafting

A workflow that compounds: ask the same AI to also draft the runbook for the alert it generated. "Now write a runbook for this alert: what should the on-call check first, what are the common causes, and what's the rollback procedure?"

You'll have an alert and a runbook in 5 minutes. Both still need human review — but the blank page is gone.

Companion resources

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

The Right Way to Pair AI With Terraform Plans

James Joyner — Fri, 03 Jul 2026 18:42:46 +0000

terraform plan is honest about what it's going to do. The problem is it's also verbose, repetitive, and full of cosmetic changes (like recomputed tags) mixed in with real ones (like a database instance scheduled for -/+ replace). On a 400-line plan, the dangerous changes hide.

This is the kind of task AI is actually good at: skimming structured text, flagging the entries that matter, ignoring the rest. But "paste plan into Claude" is not the workflow. There's a specific shape to this that works.

Why people get this wrong

The natural instinct is to copy the plan output and paste it into a chat:

Terraform will perform the following actions:

  # aws_instance.web will be updated in-place
  ~ resource "aws_instance" "web" {
        id                  = "i-0abc123def456"
      ~ instance_type       = "t3.small" -> "t3.medium"
        ...

The model will respond with a sentence about each line. You'll scroll. You'll skim. You'll miss the -/+ replace on the database because it's in the middle of 30 routine updates.

This is the same failure mode as pasting a wall of logs and asking "is anything wrong?" The model is too polite to skip things. You need to tell it to.

The format that actually works: JSON

terraform show -json tfplan outputs a structured representation of the plan that's much easier to reason about than the text format. Two reasons:

The "actions" field is explicit. Each resource_change has a change.actions array — ["create"], ["delete"], ["update"], or ["delete", "create"] for replace. No ambiguity.
You can filter before pasting. With jq, you can extract only the dangerous changes, drop the noise, and feed a 20-line summary into the AI instead of a 400-line plan.

Try this:

terraform plan -out=tfplan
terraform show -json tfplan > plan.json

# Get just the dangerous changes
jq '[.resource_changes[] |
     select(.change.actions | contains(["delete"])) |
     {address, type, actions: .change.actions}]' plan.json

That's the AI's input. Compact, unambiguous, and pre-filtered to the changes that need a human decision.

The prompt that catches what plans hide

Once you have the dangerous-changes JSON, the prompt is straightforward:

Here's a Terraform plan summary showing only delete and replace operations. For each resource, tell me: (1) what data is at risk (none / state but not data / data + state), (2) what triggered the replacement if applicable (look at the change.before vs change.after for the field that forces new), (3) the recommended action — proceed, snapshot first, or block.

The model now has a directed task with three clear outputs per finding. The response is scannable, actionable, and short.

For comparison, the same prompt against the raw text plan produces a wandering essay.

A real example

Here's a redacted plan summary from a recent change:

[
  {
    "address": "aws_instance.bastion",
    "type": "aws_instance",
    "actions": ["delete", "create"]
  },
  {
    "address": "aws_db_instance.main",
    "type": "aws_db_instance",
    "actions": ["delete", "create"]
  },
  {
    "address": "aws_s3_bucket.logs-old",
    "type": "aws_s3_bucket",
    "actions": ["delete"]
  }
]

Claude's review of this (paraphrased):

aws_instance.bastion — Replace is fine, no persistent data on bastions. Brief outage of jump-host access (~2 min). Proceed.
aws_db_instance.main — DANGEROUS. Database replace = data loss unless skip_final_snapshot = false and you've verified the snapshot will be created. Check the plan JSON for skip_final_snapshot in the change.after — if true, BLOCK. If false, the snapshot will save data but restore is a manual operation. Recommend creating a manual snapshot first regardless.
aws_s3_bucket.logs-old — Delete. If the bucket has objects, this fails by default. If force_destroy = true, all objects are deleted with the bucket. Check the bucket isn't actively used.

This is exactly the kind of review I want before applying. Without the AI, I'd probably catch the DB replace, but I might miss the force_destroy nuance on the S3 bucket because I'd be in a hurry.

What about Sentinel / OPA / Checkov?

These tools enforce policies — "no public S3 buckets," "no RDS without deletion protection." They're floor-setting. They don't help with the per-change judgment calls: "is this specific replace acceptable for this specific resource right now?"

I use both. Checkov in CI catches the consistent rule violations. AI review of the plan catches the contextual ones — the cases where a replace is technically allowed but operationally risky.

CI integration

Once this workflow proves out, you can automate it. A simple GitLab CI job:

plan-review:
  stage: review
  needs: [plan]
  image: alpine:3.20
  script:
    - apk add --no-cache jq curl
    - DANGEROUS=$(jq '[.resource_changes[] |
                       select(.change.actions | contains(["delete"]))]' plan.json)
    - |
      if [ "$(echo "$DANGEROUS" | jq length)" -gt 0 ]; then
        echo "Dangerous changes detected, requesting AI review..."
        # Call Claude API with $DANGEROUS as input
        # Post result as MR comment
      fi

This isn't quite production-ready in two paragraphs of YAML, but the pattern is: detect dangerous changes, send them to AI for contextual review, post the result where the human reviewer will see it.

The point is to make the AI review part of the workflow, not a thing you remember to do. By the time you're tired enough to miss a replace in a 400-line plan, you also won't remember to ask AI about it.

What AI can't tell you about a plan

It can't tell you whether the change is intended. A replace of a database might be deliberate (you're migrating engines), in which case the snapshot-first advice is annoying overhead. The AI sees structure; you see intent. The two together is the workflow.

It also can't tell you whether the change is complete. Sometimes a plan looks safe in isolation but breaks something downstream because of a dependency you forgot about. The AI doesn't know your downstream dependencies. You do.

The reviewer is still you. The AI is just a fast filter on the parts of the plan that need attention.

For the full prompt set on Terraform safety, see the Terraform category — including terraform-plan-review-checklist and terraform-dangerous-changes-review.

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

DevOps as a Service Pricing: What Should Businesses Expect to Pay?

James Joyner — Mon, 29 Jun 2026 22:02:07 +0000

After 25 years of keeping production systems alive — building the automation, owning the pager, and helping companies stop bleeding money on preventable outages — the question I get asked most by founders and operations leads is blunt: "What is this going to cost me?"

The honest answer is the one nobody likes: it depends. But "it depends" isn't useful if you're trying to budget. So let me give you the real version — what drives the number, the pricing models you'll actually be quoted, and a simple way to figure out whether the spend pays for itself.

Why DevOps pricing varies so much

There's no sticker price on DevOps for the same reason there's no sticker price on "fixing my house." A one-bedroom condo and a 40-year-old farmhouse are different jobs. Three things move the number more than anything else:

Company size. A two-person startup with one Linux server and a single web app is a fundamentally different engagement than a 200-person company running multiple Kubernetes clusters across regions.
Infrastructure complexity. A static site on a single cloud VM is cheap to run. A microservices platform with service meshes, multiple databases, message queues, and compliance requirements is not.
Support expectations. "Help us when something breaks during business hours" and "24/7 on-call with a 15-minute response SLA" are priced an order of magnitude apart, because one of them owns someone's nights and weekends.

Before you can compare quotes, you have to be honest about which of those buckets you're actually in. A provider quoting you a low number may simply be assuming a smaller scope than the one you need.

The common pricing models

Most DevOps as a Service work is sold under one of five models. Each fits a different situation, and good providers will steer you toward the right one rather than forcing everything into their favorite.

Hourly / time-and-materials

You pay for hours worked, usually billed against a monthly cap.

When it fits: Small, well-defined tasks, ad-hoc help, or an early relationship where neither side knows the full scope yet.
Rough ballpark: Rates vary widely by region and seniority. The trap is that hourly incentivizes activity, not outcomes — a cheap hourly rate from someone who takes three times as long is not a bargain.

Monthly retainer

A fixed monthly fee buys you a block of capacity and ongoing ownership of your infrastructure.

When it fits: You have living infrastructure that needs continuous care — patching, monitoring, upgrades, small improvements — and you want a predictable line item.
Example: Ongoing Kubernetes version upgrades, Prometheus and Grafana tuning, and routine Ansible-driven patching of your Linux fleet are classic retainer work. The cluster doesn't stop needing attention, so neither does the engagement.

Project-based / fixed bid

A scoped deliverable for a fixed price.

When it fits: A clear, bounded build with a defined "done."
Example: A one-time Terraform plus GitLab CI/CD build-out — provision the cloud accounts, write the infrastructure as code, stand up the pipelines, Dockerize the apps, and hand it over — is naturally project-priced. You know what you're getting and what it costs before work starts.

Emergency / incident support

On-demand help when production is on fire, often at a premium rate or via a pre-paid response retainer.

When it fits: You run your own systems day-to-day but want a number to call when something serious breaks.
Reality check: This is the most expensive way to buy help per hour, because you're paying for someone to drop everything. It's insurance, not a maintenance plan — and it's far cheaper to prevent the incident than to buy emergency labor mid-outage.

Fully managed service

The provider owns your DevOps function end to end — infrastructure, pipelines, monitoring, security, on-call, the lot.

When it fits: You don't want to hire and retain an internal platform team, or you want to extend the small one you have.
Reality check: This is the highest monthly spend, but compare it against the loaded cost of hiring senior engineers, the recruiting time, and the bus-factor risk of a one-person internal team. Often it's cheaper and far less fragile than building the same capability in-house.

A healthy engagement often mixes models: a project-priced initial build-out, then a monthly retainer to run what was built.

What services actually move the price

Within any model, the scope of work is what sets the number. The big cost factors:

Cloud setup and infrastructure as code. Account structure, networking, and Terraform modules to make it all reproducible.
CI/CD pipelines. Building and maintaining GitLab CI/CD (or equivalent) so deploys are fast, repeatable, and safe.
Containers and orchestration. Docker images, registries, and Kubernetes — the single biggest complexity multiplier in modern infrastructure.
Monitoring and observability. Prometheus, Grafana, alerting rules, and dashboards. Good monitoring and alert generation is what turns a 3am outage into a 9am ticket.
Security. Secrets management, access control, network policy, vulnerability scanning, and hardening of your Linux servers.
Backups and disaster recovery. Tested restores — not just backups that exist on paper.
Incident response and on-call. The cost of someone being awake and accountable when things go wrong.
Automation. Ansible playbooks and scripting that replace manual, error-prone toil.
Compliance. SOC 2, HIPAA, PCI, and friends add audit, documentation, and control work that materially raises cost.

The more of these you need, and the higher the stakes, the higher the price. That's not padding — it's the actual work of keeping a real system running.

Why cheaper is not always better

Here's where my experience makes me opinionated: in production infrastructure, the cheapest quote is frequently the most expensive decision.

A low bid usually means one of a few things — a junior engineer learning on your dime, a scope that quietly excludes monitoring or backups, or a contractor who'll bolt something together and disappear before the technical debt comes due. You don't find out until the pipeline breaks at the worst possible moment, the backups turn out to be untested, or a security gap becomes an incident.

Infrastructure is one of those areas where you're not buying hours — you're buying the absence of disasters. That's hard to see on an invoice and very easy to feel in an outage.

What downtime actually costs

This is the framing that changes the conversation. Put a number on downtime and the "expensive" DevOps quote suddenly looks like a rounding error.

A simple cost-of-downtime model:

Downtime cost per hour = (Annual revenue / Business hours per year) + recovery labor + reputation/churn cost

Work a concrete example. Say a business does $5,000,000 in revenue a year and runs roughly 3,000 business hours. That's about $1,667 per hour in direct lost revenue — before you add the engineers pulled off roadmap work to firefight, the customers who churn, and the support load from a public incident. Call it $2,500–$4,000 an hour, conservatively.

Now consider what causes that downtime in shops without proper DevOps:

Failed deployments with no pipeline safeguards or rollback — a bad release that takes hours to unwind.
Poor monitoring that means you learn about the outage from angry customers instead of an alert, adding 30+ minutes of pure detection delay to every incident.
Manual, undocumented processes where only one person knows how to restore the service, and they're on vacation.

A single multi-hour outage can cost more than a year of competent monitoring and incident-response coverage. The DevOps spend isn't competing with zero — it's competing with the outages it prevents.

How AI changes the math

Part of why DevOps value-for-money has improved is that AI now removes a large slice of the repetitive labor that used to fill the bill.

Drafting and reviewing infrastructure as code. Terraform and Ansible scaffolding that used to take hours gets drafted in minutes, then reviewed by a human.
Pipeline and config generation. GitLab CI/CD configs, Dockerfiles, and Kubernetes manifests start from a solid AI-generated baseline instead of a blank file.
Monitoring setup. Generating sensible Prometheus alert rules and Grafana panels — historically tedious, easily templated work — is far faster with AI assistance.
Incident triage. Summarizing logs and correlating "what changed" compresses the slow part of an outage.

The key word is assisted — a human still owns every change to production. But a provider using AI well can deliver more per dollar, which means you get broader coverage for the same budget. If you want to see the kind of work this accelerates, our prompt library shows the patterns we lean on.

Starting lean: startups and small businesses

If you're early-stage, you do not need a fully managed enterprise engagement, and you shouldn't pay for one. Start with a lean package that covers the essentials and nothing you won't use yet:

A reproducible cloud setup with Terraform, so you're never clicking around a console by hand.
One clean CI/CD pipeline so deploys are boring and repeatable.
Basic monitoring and alerting on the handful of metrics that actually predict outages.
Tested backups.
A documented runbook so recovery doesn't depend on one person's memory.

That's a modest retainer or a small fixed-bid build-out, and it removes the failure modes that sink small companies. You add Kubernetes, deeper observability, and compliance work later — when the business actually needs them, not before. You can see how we structure tiers like this on our pricing page.

How to calculate ROI

Don't buy DevOps on vibes. Run the numbers. A usable formula:

ROI (%) = ((Value gained - Cost of service) / Cost of service) x 100

Where value gained is the sum of:

Downtime avoided — fewer outage hours × your cost-of-downtime-per-hour.
Engineering time reclaimed — hours your developers stop spending on infrastructure toil, at their loaded cost.
Faster delivery — features shipped sooner because the pipeline is fast and reliable.
Incidents prevented — the emergency-rate firefighting you never have to buy.

A worked example. Suppose a managed engagement costs $60,000 a year. Over that year it:

Prevents an estimated 20 hours of downtime at $3,000/hour = $60,000.
Frees two developers from ~5 hours/week of infra work — roughly $50,000 of reclaimed engineering time.
Speeds delivery enough to pull in revenue you'd otherwise have deferred — call it $30,000, conservatively.

That's $140,000 of value against $60,000 of cost:

ROI = (($140,000 - $60,000) / $60,000) x 100 = ~133%

Even if you halve every one of those estimates to be safe, you're still solidly positive. The exercise matters more than the exact figures — when you actually price the downtime you avoid and the time you reclaim, good DevOps consistently pays for itself.

The bottom line

DevOps as a Service pricing genuinely varies, and any provider who hands you a flat number without understanding your systems is guessing. But the framework is straightforward: know which size and complexity bucket you're in, pick the pricing model that fits the work, scope the services you actually need, and run the ROI math against the very real cost of doing nothing.

The mistake I see most often is treating DevOps as a cost line to minimize. It isn't. It's an investment in uptime, delivery speed, security, and the ability to scale without setting your infrastructure on fire. Price it against the outages, the lost engineering hours, and the deals you can't close because the platform won't hold — and the question stops being "what does this cost?" and becomes "what is it costing me not to have it?"

Cost figures and ranges here are illustrative. Build your own estimate from your real revenue, infrastructure, and risk profile before committing to a budget.

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

DEV Community: DevOps AI ToolKit

The 12 DevOps Errors That Page Teams Most (And the First Thing to Check)

1. CrashLoopBackOff

2. ImagePullBackOff / ErrImagePull

3. OOMKilled (exit code 137)

4. No space left on device

5. DNS timeouts inside pods

6. FATAL: sorry, too many clients already (Postgres)

7. Connection refused

8. TLS handshake timeout

9. Read-only file system

10. Multi-Attach error for volume

11. 502 Bad Gateway

12. exec format error

The pattern

I Built Free Browser-Based Validators for YAML, Kubernetes and Terraform (No Upload, No Signup)

What they are

The one design decision that matters

Why I bother

Try them

It Works on My Machine: A Docker War Story About exec format error

The setup

The tell I walked right past

Confirming it

The fix

What it was actually teaching

The 10 Docker Errors That Waste the Most Time (and the One-Line Fix)

1. Cannot connect to the Docker daemon at unix:///var/run/docker.sock

2. no space left on device

3. Bind for 0.0.0.0:8080 failed: port is already allocated

4. pull access denied ... repository does not exist or may require 'docker login'

5. exec format error

6. OCI runtime create failed

7. executable file not found in $PATH

8. TLS handshake timeout

9. failed to compute cache key: ... not found (COPY/ADD)

10. Conflict. The container name "/x" is already in use

The pattern

How I Cut a Docker Image From 1.2GB to 180MB

1. Multi-stage builds (the big one)

2. Pick a smaller base

3. Order layers by how often they change

4. Add a real .dockerignore

5. Collapse and clean up RUN layers

The results

7 Dockerfile Mistakes That Are Quietly Costing You

1. Running as root

2. FROM some-image:latest

3. Baking secrets into layers

4. No .dockerignore

5. Layer order that destroys your cache

6. Leaving package manager cruft in the image

7. No HEALTHCHECK

The theme

Reading Loki Logs With AI: Patterns That Work

The basics AI handles well

Where AI gets LogQL wrong

Confusing PromQL and LogQL syntax

Using labels that aren't indexed

Inventing operators

Confused unwrap behavior

A workflow for unfamiliar log shapes

Step 1: Get a sample

Step 2: Show the AI the sample

Step 3: Generate the query

Step 4: Verify before alerting

A trap worth flagging

Logs as a debugging input for AI

The pattern that ties it together

The Pod That Lied: A Kubernetes Readiness Probe War Story

The symptom that wasn't there

Finding the jokers

The lie, and who told it

Getting the jokers out of the deck

Why green lies, and why I still love this

The Volume That Wouldn't Let Go: A Kubernetes PVC War Story

The symptom, and the small lie it told

The mental model that actually matters

The fix, which was mostly a decision

The epiphany, sitting on my kitchen floor

1. `CrashLoopBackOff`

2. `ImagePullBackOff` / `ErrImagePull`

3. `OOMKilled` (exit code 137)

4. `No space left on device`

6. `FATAL: sorry, too many clients already` (Postgres)

7. `Connection refused`

8. `TLS handshake timeout`

9. `Read-only file system`

10. `Multi-Attach error for volume`

11. `502 Bad Gateway`

12. `exec format error`

1. `Cannot connect to the Docker daemon at unix:///var/run/docker.sock`

2. `no space left on device`

3. `Bind for 0.0.0.0:8080 failed: port is already allocated`

4. `pull access denied ... repository does not exist or may require 'docker login'`

5. `exec format error`

6. `OCI runtime create failed`

7. `executable file not found in $PATH`

8. `TLS handshake timeout`

9. `failed to compute cache key: ... not found` (COPY/ADD)

10. `Conflict. The container name "/x" is already in use`

4. Add a real `.dockerignore`

5. Collapse and clean up `RUN` layers

2. `FROM some-image:latest`

4. No `.dockerignore`

7. No `HEALTHCHECK`