<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: James Joyner</title>
    <description>The latest articles on DEV Community by James Joyner (@jjoyneriv).</description>
    <link>https://dev.to/jjoyneriv</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3973259%2Fcec8573d-242f-4c1d-a0a5-d1120d871e27.png</url>
      <title>DEV Community: James Joyner</title>
      <link>https://dev.to/jjoyneriv</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jjoyneriv"/>
    <language>en</language>
    <item>
      <title>DevOps as a Service Pricing: What Should Businesses Expect to Pay?</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Mon, 29 Jun 2026 22:02:07 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/devops-as-a-service-pricing-what-should-businesses-expect-to-pay-2481</link>
      <guid>https://dev.to/devopsaitoolkit/devops-as-a-service-pricing-what-should-businesses-expect-to-pay-2481</guid>
      <description>&lt;p&gt;After 25 years of keeping production systems alive — building the automation, owning the pager, and helping companies stop bleeding money on preventable outages — the question I get asked most by founders and operations leads is blunt: &lt;em&gt;"What is this going to cost me?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The honest answer is the one nobody likes: it depends. But "it depends" isn't useful if you're trying to budget. So let me give you the real version — what drives the number, the pricing models you'll actually be quoted, and a simple way to figure out whether the spend pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DevOps pricing varies so much
&lt;/h2&gt;

&lt;p&gt;There's no sticker price on DevOps for the same reason there's no sticker price on "fixing my house." A one-bedroom condo and a 40-year-old farmhouse are different jobs. Three things move the number more than anything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Company size.&lt;/strong&gt; A two-person startup with one Linux server and a single web app is a fundamentally different engagement than a 200-person company running multiple Kubernetes clusters across regions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure complexity.&lt;/strong&gt; A static site on a single cloud VM is cheap to run. A microservices platform with service meshes, multiple databases, message queues, and compliance requirements is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support expectations.&lt;/strong&gt; "Help us when something breaks during business hours" and "24/7 on-call with a 15-minute response SLA" are priced an order of magnitude apart, because one of them owns someone's nights and weekends.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before you can compare quotes, you have to be honest about which of those buckets you're actually in. A provider quoting you a low number may simply be assuming a smaller scope than the one you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  The common pricing models
&lt;/h2&gt;

&lt;p&gt;Most DevOps as a Service work is sold under one of five models. Each fits a different situation, and good providers will steer you toward the right one rather than forcing everything into their favorite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hourly / time-and-materials
&lt;/h3&gt;

&lt;p&gt;You pay for hours worked, usually billed against a monthly cap.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; Small, well-defined tasks, ad-hoc help, or an early relationship where neither side knows the full scope yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rough ballpark:&lt;/strong&gt; Rates vary widely by region and seniority. The trap is that hourly incentivizes activity, not outcomes — a cheap hourly rate from someone who takes three times as long is not a bargain.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Monthly retainer
&lt;/h3&gt;

&lt;p&gt;A fixed monthly fee buys you a block of capacity and ongoing ownership of your infrastructure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; You have living infrastructure that needs continuous care — patching, monitoring, upgrades, small improvements — and you want a predictable line item.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Ongoing Kubernetes version upgrades, Prometheus and Grafana tuning, and routine Ansible-driven patching of your Linux fleet are classic retainer work. The cluster doesn't stop needing attention, so neither does the engagement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project-based / fixed bid
&lt;/h3&gt;

&lt;p&gt;A scoped deliverable for a fixed price.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; A clear, bounded build with a defined "done."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; A one-time Terraform plus GitLab CI/CD build-out — provision the cloud accounts, write the infrastructure as code, stand up the pipelines, Dockerize the apps, and hand it over — is naturally project-priced. You know what you're getting and what it costs before work starts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Emergency / incident support
&lt;/h3&gt;

&lt;p&gt;On-demand help when production is on fire, often at a premium rate or via a pre-paid response retainer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; You run your own systems day-to-day but want a number to call when something serious breaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reality check:&lt;/strong&gt; This is the most expensive way to buy help per hour, because you're paying for someone to drop everything. It's insurance, not a maintenance plan — and it's far cheaper to prevent the incident than to buy emergency labor mid-outage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fully managed service
&lt;/h3&gt;

&lt;p&gt;The provider owns your DevOps function end to end — infrastructure, pipelines, monitoring, security, on-call, the lot.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; You don't want to hire and retain an internal platform team, or you want to extend the small one you have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reality check:&lt;/strong&gt; This is the highest monthly spend, but compare it against the loaded cost of hiring senior engineers, the recruiting time, and the bus-factor risk of a one-person internal team. Often it's cheaper and far less fragile than building the same capability in-house.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A healthy engagement often mixes models: a project-priced initial build-out, then a monthly retainer to run what was built.&lt;/p&gt;

&lt;h2&gt;
  
  
  What services actually move the price
&lt;/h2&gt;

&lt;p&gt;Within any model, the scope of work is what sets the number. The big cost factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud setup and infrastructure as code.&lt;/strong&gt; Account structure, networking, and Terraform modules to make it all reproducible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD pipelines.&lt;/strong&gt; Building and maintaining GitLab CI/CD (or equivalent) so deploys are fast, repeatable, and safe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containers and orchestration.&lt;/strong&gt; Docker images, registries, and Kubernetes — the single biggest complexity multiplier in modern infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring and observability.&lt;/strong&gt; Prometheus, Grafana, alerting rules, and dashboards. Good &lt;a href="https://dev.to/dashboard/monitoring-alerts/"&gt;monitoring and alert generation&lt;/a&gt; is what turns a 3am outage into a 9am ticket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security.&lt;/strong&gt; Secrets management, access control, network policy, vulnerability scanning, and hardening of your Linux servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backups and disaster recovery.&lt;/strong&gt; Tested restores — not just backups that exist on paper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response and on-call.&lt;/strong&gt; The cost of someone being awake and accountable when things go wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation.&lt;/strong&gt; Ansible playbooks and scripting that replace manual, error-prone toil.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance.&lt;/strong&gt; SOC 2, HIPAA, PCI, and friends add audit, documentation, and control work that materially raises cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more of these you need, and the higher the stakes, the higher the price. That's not padding — it's the actual work of keeping a real system running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why cheaper is not always better
&lt;/h2&gt;

&lt;p&gt;Here's where my experience makes me opinionated: in production infrastructure, the cheapest quote is frequently the most expensive decision.&lt;/p&gt;

&lt;p&gt;A low bid usually means one of a few things — a junior engineer learning on your dime, a scope that quietly excludes monitoring or backups, or a contractor who'll bolt something together and disappear before the technical debt comes due. You don't find out until the pipeline breaks at the worst possible moment, the backups turn out to be untested, or a security gap becomes an incident.&lt;/p&gt;

&lt;p&gt;Infrastructure is one of those areas where you're not buying hours — you're buying the absence of disasters. That's hard to see on an invoice and very easy to feel in an outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What downtime actually costs
&lt;/h2&gt;

&lt;p&gt;This is the framing that changes the conversation. Put a number on downtime and the "expensive" DevOps quote suddenly looks like a rounding error.&lt;/p&gt;

&lt;p&gt;A simple cost-of-downtime model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Downtime cost per hour = (Annual revenue / Business hours per year) + recovery labor + reputation/churn cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Work a concrete example. Say a business does &lt;strong&gt;$5,000,000&lt;/strong&gt; in revenue a year and runs roughly &lt;strong&gt;3,000 business hours&lt;/strong&gt;. That's about &lt;strong&gt;$1,667 per hour&lt;/strong&gt; in direct lost revenue — before you add the engineers pulled off roadmap work to firefight, the customers who churn, and the support load from a public incident. Call it &lt;strong&gt;$2,500–$4,000 an hour&lt;/strong&gt;, conservatively.&lt;/p&gt;

&lt;p&gt;Now consider what causes that downtime in shops without proper DevOps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failed deployments&lt;/strong&gt; with no pipeline safeguards or rollback — a bad release that takes hours to unwind.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor monitoring&lt;/strong&gt; that means you learn about the outage from angry customers instead of an alert, adding 30+ minutes of pure detection delay to every incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual, undocumented processes&lt;/strong&gt; where only one person knows how to restore the service, and they're on vacation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single multi-hour outage can cost more than a year of competent monitoring and incident-response coverage. The DevOps spend isn't competing with zero — it's competing with the outages it prevents.&lt;/p&gt;

&lt;h2&gt;
  
  
  How AI changes the math
&lt;/h2&gt;

&lt;p&gt;Part of why DevOps value-for-money has improved is that AI now removes a large slice of the repetitive labor that used to fill the bill.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drafting and reviewing infrastructure as code.&lt;/strong&gt; Terraform and Ansible scaffolding that used to take hours gets drafted in minutes, then reviewed by a human.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline and config generation.&lt;/strong&gt; GitLab CI/CD configs, Dockerfiles, and Kubernetes manifests start from a solid AI-generated baseline instead of a blank file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring setup.&lt;/strong&gt; Generating sensible Prometheus alert rules and Grafana panels — historically tedious, easily templated work — is far faster with AI assistance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident triage.&lt;/strong&gt; Summarizing logs and correlating "what changed" compresses the slow part of an outage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key word is &lt;em&gt;assisted&lt;/em&gt; — a human still owns every change to production. But a provider using AI well can deliver more per dollar, which means you get broader coverage for the same budget. If you want to see the kind of work this accelerates, our &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; shows the patterns we lean on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting lean: startups and small businesses
&lt;/h2&gt;

&lt;p&gt;If you're early-stage, you do not need a fully managed enterprise engagement, and you shouldn't pay for one. Start with a lean package that covers the essentials and nothing you won't use yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A reproducible cloud setup with Terraform, so you're never clicking around a console by hand.&lt;/li&gt;
&lt;li&gt;One clean CI/CD pipeline so deploys are boring and repeatable.&lt;/li&gt;
&lt;li&gt;Basic monitoring and alerting on the handful of metrics that actually predict outages.&lt;/li&gt;
&lt;li&gt;Tested backups.&lt;/li&gt;
&lt;li&gt;A documented runbook so recovery doesn't depend on one person's memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a modest retainer or a small fixed-bid build-out, and it removes the failure modes that sink small companies. You add Kubernetes, deeper observability, and compliance work later — when the business actually needs them, not before. You can see how we structure tiers like this on our &lt;a href="https://dev.to/pricing"&gt;pricing&lt;/a&gt; page.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to calculate ROI
&lt;/h2&gt;

&lt;p&gt;Don't buy DevOps on vibes. Run the numbers. A usable formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROI (%) = ((Value gained - Cost of service) / Cost of service) x 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;strong&gt;value gained&lt;/strong&gt; is the sum of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Downtime avoided&lt;/strong&gt; — fewer outage hours × your cost-of-downtime-per-hour.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering time reclaimed&lt;/strong&gt; — hours your developers stop spending on infrastructure toil, at their loaded cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster delivery&lt;/strong&gt; — features shipped sooner because the pipeline is fast and reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incidents prevented&lt;/strong&gt; — the emergency-rate firefighting you never have to buy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A worked example. Suppose a managed engagement costs &lt;strong&gt;$60,000 a year&lt;/strong&gt;. Over that year it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevents an estimated &lt;strong&gt;20 hours&lt;/strong&gt; of downtime at &lt;strong&gt;$3,000/hour&lt;/strong&gt; = &lt;strong&gt;$60,000&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Frees &lt;strong&gt;two developers&lt;/strong&gt; from ~5 hours/week of infra work — roughly &lt;strong&gt;$50,000&lt;/strong&gt; of reclaimed engineering time.&lt;/li&gt;
&lt;li&gt;Speeds delivery enough to pull in revenue you'd otherwise have deferred — call it &lt;strong&gt;$30,000&lt;/strong&gt;, conservatively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's &lt;strong&gt;$140,000&lt;/strong&gt; of value against &lt;strong&gt;$60,000&lt;/strong&gt; of cost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROI = (($140,000 - $60,000) / $60,000) x 100 = ~133%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if you halve every one of those estimates to be safe, you're still solidly positive. The exercise matters more than the exact figures — when you actually price the downtime you avoid and the time you reclaim, good DevOps consistently pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;DevOps as a Service pricing genuinely varies, and any provider who hands you a flat number without understanding your systems is guessing. But the framework is straightforward: know which size and complexity bucket you're in, pick the pricing model that fits the work, scope the services you actually need, and run the ROI math against the very real cost of doing nothing.&lt;/p&gt;

&lt;p&gt;The mistake I see most often is treating DevOps as a cost line to minimize. It isn't. It's an investment in uptime, delivery speed, security, and the ability to scale without setting your infrastructure on fire. Price it against the outages, the lost engineering hours, and the deals you can't close because the platform won't hold — and the question stops being "what does this cost?" and becomes "what is it costing me not to have it?"&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cost figures and ranges here are illustrative. Build your own estimate from your real revenue, infrastructure, and risk profile before committing to a budget.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/devops-as-a-service-pricing-what-to-expect/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>pricing</category>
      <category>manageddevops</category>
      <category>roi</category>
    </item>
    <item>
      <title>Best DevSecOps Security Tools for CI/CD Pipeline Protection</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sun, 28 Jun 2026 05:05:55 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/best-devsecops-security-tools-for-cicd-pipeline-protection-25ff</link>
      <guid>https://dev.to/devopsaitoolkit/best-devsecops-security-tools-for-cicd-pipeline-protection-25ff</guid>
      <description>&lt;p&gt;I've spent twenty-five years building and securing deployment pipelines, and the single biggest shift in that time isn't a tool — it's &lt;em&gt;where&lt;/em&gt; security lives. We used to bolt it on at the end, right before a release, when changing anything was expensive and everyone was already tired. That's backwards. DevSecOps is the correction: you move security checks left, into the pipeline, so problems surface when they're cheap to fix and the person who introduced them is still looking at the code.&lt;/p&gt;

&lt;p&gt;This is a tour of the tool &lt;em&gt;categories&lt;/em&gt; that matter, with representative (mostly open-source) examples and where each one fits in a real GitLab CI/CD or GitHub Actions pipeline. It is not a ranked ad. The right toolchain depends on your team size and how mature your infrastructure is, and I'll come back to that at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DevSecOps actually means
&lt;/h2&gt;

&lt;p&gt;DevSecOps is "shift-left security": treating security as a property of the pipeline, not a gate at the end of it. Concretely, it means your CI runs the same checks a security reviewer would — scanning code, dependencies, containers, infrastructure definitions, and secrets — &lt;em&gt;automatically, on every push&lt;/em&gt;, and fails the build when it finds something that genuinely matters.&lt;/p&gt;

&lt;p&gt;The goal isn't to drown developers in findings. It's to catch the dangerous classes of mistakes early and consistently, so security becomes muscle memory instead of a quarterly fire drill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the CI/CD pipeline is a prime target
&lt;/h2&gt;

&lt;p&gt;Your pipeline is the most privileged thing in your engineering org and the least watched. It has credentials to your cloud, your registry, and production. That makes it a high-value target in ways teams routinely underestimate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Supply-chain attacks.&lt;/strong&gt; A compromised dependency or a malicious GitHub Action runs &lt;em&gt;inside&lt;/em&gt; your build with your secrets in the environment. SolarWinds and the Codecov breach were both pipeline-level compromises, not application bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets sprawl.&lt;/strong&gt; Pipelines are where API keys, cloud credentials, and registry tokens live. A leaked CI variable or a key hardcoded in a &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; is a direct path to your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runner trust.&lt;/strong&gt; Shared or self-hosted runners that build untrusted code can be poisoned. A pull request from a fork that triggers a privileged job is a classic foothold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifact tampering.&lt;/strong&gt; If nothing verifies that the image you deploy is the image you built, an attacker who can write to your registry can swap it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every category below is a defense against one or more of these.&lt;/p&gt;

&lt;h2&gt;
  
  
  SAST — Static Application Security Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans your source code without running it, looking for vulnerable patterns — SQL injection, command injection, unsafe deserialization, hardcoded crypto mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; early, on every merge request, as a fast job that runs before anything gets built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://semgrep.dev" rel="noopener noreferrer"&gt;Semgrep&lt;/a&gt; is my default — it's open-source, fast, and its rules read like the code they match, so writing a custom rule for your own footguns takes minutes. GitLab ships a built-in &lt;a href="https://docs.gitlab.com/ee/user/application_security/sast/" rel="noopener noreferrer"&gt;SAST&lt;/a&gt; analyzer you can enable with a single &lt;code&gt;include&lt;/code&gt; in your &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;. For Python-specific work, Bandit is a lightweight option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; run SAST in &lt;em&gt;diff&lt;/em&gt; mode on merge requests so it only flags new findings. Nothing kills adoption faster than a first run that reports 4,000 pre-existing issues and blocks everyone. Baseline the legacy debt, enforce on new code.&lt;/p&gt;

&lt;h2&gt;
  
  
  DAST — Dynamic Application Security Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; attacks a &lt;em&gt;running&lt;/em&gt; instance of your app the way an external scanner would — probing for XSS, injection, misconfigured headers, and exposed endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; later in the pipeline, after you've deployed the build to an ephemeral staging environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://www.zaproxy.org" rel="noopener noreferrer"&gt;OWASP ZAP&lt;/a&gt; is the open-source standard. Its baseline scan runs cleanly as a Docker container in a CI job — point it at your staging URL and it produces a report you can fail the pipeline on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; DAST is slower and noisier than SAST, so don't gate every merge request on a full active scan. Run a quick ZAP baseline scan on merge requests and a deeper scan nightly against staging.&lt;/p&gt;

&lt;h2&gt;
  
  
  SCA / dependency scanning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Software Composition Analysis inventories your third-party dependencies and flags ones with known CVEs. Most of the code in your app isn't yours, and this is where most exploitable vulnerabilities actually live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; on every build, and continuously in the background as new CVEs are published against dependencies you already shipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://trivy.dev" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt; and &lt;a href="https://github.com/anchore/grype" rel="noopener noreferrer"&gt;Grype&lt;/a&gt; both scan lockfiles and dependency manifests for vulnerabilities, open-source and fast. For &lt;em&gt;fixing&lt;/em&gt; — not just finding — &lt;a href="https://docs.github.com/en/code-security/dependabot" rel="noopener noreferrer"&gt;Dependabot&lt;/a&gt; (GitHub-native) and &lt;a href="https://docs.renovatebot.com" rel="noopener noreferrer"&gt;Renovate&lt;/a&gt; (works everywhere, including GitLab) open automated PRs that bump vulnerable packages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; pair scanning with automated updates. Trivy tells you you're exposed; Renovate is what closes the gap before it rots into a quarter-long upgrade project. A scanner with no remediation path just generates anxiety.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container image scanning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans the layers of a built Docker image — OS packages and language dependencies baked into it — for known vulnerabilities. The friendly &lt;code&gt;node:18&lt;/code&gt; base image you pulled six months ago has accumulated CVEs since.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; right after you build the image, before you push it to your registry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://trivy.dev" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt; again (it does both SCA and image scanning, which is why it's so widely deployed), &lt;a href="https://github.com/anchore/grype" rel="noopener noreferrer"&gt;Grype&lt;/a&gt;, and &lt;a href="https://github.com/quay/clair" rel="noopener noreferrer"&gt;Clair&lt;/a&gt;, which underpins several registries' built-in scanning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; scan &lt;em&gt;before&lt;/em&gt; the push and gate the push on the result. A concrete GitLab CI pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;container_scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scan&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasec/trivy:latest&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;trivy image --exit-code 1 --severity CRITICAL,HIGH "$IMAGE_TAG"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--exit-code 1&lt;/code&gt; makes the job — and therefore the pipeline — fail on a critical finding, so a vulnerable image never reaches the registry in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code scanning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans your Terraform, Ansible, CloudFormation, and Kubernetes manifests for insecure configuration &lt;em&gt;before&lt;/em&gt; it provisions anything — public S3 buckets, security groups open to &lt;code&gt;0.0.0.0/0&lt;/code&gt;, unencrypted volumes, over-permissive IAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; in the plan/validate stage, before &lt;code&gt;terraform apply&lt;/code&gt; or &lt;code&gt;ansible-playbook&lt;/code&gt; runs against real infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://www.checkov.io" rel="noopener noreferrer"&gt;Checkov&lt;/a&gt; is the broadest — it covers Terraform, Ansible, Kubernetes, and more out of the box. &lt;a href="https://github.com/aquasecurity/tfsec" rel="noopener noreferrer"&gt;tfsec&lt;/a&gt; is Terraform-focused and fast (now converging with Trivy). &lt;a href="https://kics.io" rel="noopener noreferrer"&gt;KICS&lt;/a&gt; covers a wide spread of IaC formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; scan the Terraform &lt;em&gt;plan&lt;/em&gt;, not just the source, so you catch issues in the resolved configuration. In a GitHub Actions step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkov on Terraform plan&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;terraform plan -out=tfplan.binary&lt;/span&gt;
    &lt;span class="s"&gt;terraform show -json tfplan.binary &amp;gt; tfplan.json&lt;/span&gt;
    &lt;span class="s"&gt;checkov -f tfplan.json --quiet&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches the misconfiguration before it becomes a public bucket someone finds for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secrets detection
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans commits, history, and CI config for leaked credentials — AWS keys, tokens, private keys, passwords. This is the highest-leverage category for the effort involved, because a single leaked cloud key can cost you the whole environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; everywhere — as a pre-commit hook on the developer's machine &lt;em&gt;and&lt;/em&gt; as a CI job that scans the full diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://github.com/gitleaks/gitleaks" rel="noopener noreferrer"&gt;Gitleaks&lt;/a&gt; and &lt;a href="https://github.com/trufflesecurity/trufflehog" rel="noopener noreferrer"&gt;TruffleHog&lt;/a&gt; are the open-source workhorses. Run both through the &lt;a href="https://pre-commit.com" rel="noopener noreferrer"&gt;pre-commit&lt;/a&gt; framework so secrets get caught &lt;em&gt;before&lt;/em&gt; they ever hit the remote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; defense in depth. The pre-commit hook stops the honest mistake locally; the CI job catches the developer who skipped the hook with &lt;code&gt;--no-verify&lt;/code&gt;. A combined Trivy-plus-Gitleaks GitLab job that fails the pipeline on a critical finding is one of the cheapest, highest-value things you can add this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Policy-as-code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; lets you encode organizational rules as machine-checkable policy — "no container runs as root," "every image must come from our approved registry," "no resource without a cost-center tag" — and enforces them automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; two places. In CI, to validate manifests before merge. And at the Kubernetes admission layer, to block non-compliant workloads at deploy time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://www.openpolicyagent.org" rel="noopener noreferrer"&gt;OPA&lt;/a&gt; with &lt;a href="https://www.conftest.dev" rel="noopener noreferrer"&gt;Conftest&lt;/a&gt; for testing config files in CI (Terraform plans, Kubernetes YAML, Dockerfiles) against Rego policies. &lt;a href="https://kyverno.io" rel="noopener noreferrer"&gt;Kyverno&lt;/a&gt; for Kubernetes admission control, where policies are written as Kubernetes resources rather than a separate language — a gentler on-ramp for teams already fluent in YAML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; policy-as-code is how you turn "we agreed in a meeting that prod images must be signed" into something the cluster &lt;em&gt;enforces&lt;/em&gt;. Start with one or two high-value policies in audit mode, watch what they'd block, then flip to enforce.&lt;/p&gt;

&lt;p&gt;This is also the layer that defends against artifact tampering. Generate a signature at build time with &lt;a href="https://docs.sigstore.dev/cosign/overview/" rel="noopener noreferrer"&gt;cosign&lt;/a&gt; and gate the registry push — and the cluster admission — behind a valid signature. If the image isn't the one you built and signed, it doesn't deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime security monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; watches running workloads for suspicious behavior — a container spawning a shell, an unexpected outbound connection, a write to a sensitive path. This is the backstop for everything your pre-deploy scans missed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; in production, continuously, outside the pipeline — but it closes the DevSecOps loop by feeding real attack signals back to the people who build the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://falco.org" rel="noopener noreferrer"&gt;Falco&lt;/a&gt; is the open-source standard, using eBPF to observe kernel-level syscalls with minimal overhead and alert on rule violations. The broader eBPF tooling ecosystem (Cilium, Tetragon) extends this into network policy and deep observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; runtime security is your evidence that shift-left isn't perfect — and it never is. Treat a Falco alert as both an incident &lt;em&gt;and&lt;/em&gt; a signal to add a check earlier in the pipeline so the same class of thing gets caught before deploy next time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to choose: team size and infrastructure maturity
&lt;/h2&gt;

&lt;p&gt;You do not need all of the above on day one. Coverage you can't maintain is worse than honest gaps, because it breeds alert fatigue and trains your team to click past warnings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lean startup / small team.&lt;/strong&gt; Pick the three highest-leverage, lowest-friction tools and run them as failing CI jobs: secrets detection (Gitleaks), dependency + image scanning (Trivy), and IaC scanning (Checkov) if you're running Terraform. That's maybe an afternoon of pipeline work and it eliminates the most common ways small teams get owned. Add Dependabot or Renovate so you're patching, not just panicking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mid-size, growing team.&lt;/strong&gt; Layer in SAST in diff mode, DAST against staging, and image signing with cosign. Start introducing policy-as-code in audit mode so you understand your real posture before you enforce.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mature org with regulatory pressure.&lt;/strong&gt; Now the full stack earns its keep: enforced policy-as-code at admission, runtime monitoring with Falco, signed-and-verified artifacts end to end, and centralized reporting so security can see across teams. At this scale the integration and dashboards matter as much as the scanners.&lt;/p&gt;

&lt;p&gt;The pattern is consistent: each category, run as a job that &lt;em&gt;fails fast on real risk&lt;/em&gt;, beats a fancy tool that produces a report nobody reads. If you want to go deeper on pipeline patterns, our &lt;a href="https://dev.to/categories/gitlab-cicd/"&gt;GitLab CI/CD prompts&lt;/a&gt; cover a lot of this ground.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI fits — assistive, not authoritative
&lt;/h2&gt;

&lt;p&gt;Modern security tooling generates a &lt;em&gt;lot&lt;/em&gt; of output, and this is where AI genuinely earns its place. It's good at the reading and summarizing that wears humans down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triaging scan output.&lt;/strong&gt; Paste a wall of Trivy or Semgrep findings and ask the model to group them, identify which CVEs are actually reachable in your code path, and rank by real exploitability. A list of 200 "HIGH" findings becomes "these 6 matter, here's why."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explaining a finding.&lt;/strong&gt; "What does this Checkov failure mean and what's the minimal Terraform change to fix it?" turns a cryptic policy ID into an actionable diff.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drafting remediation.&lt;/strong&gt; Generate the patched dependency version, the hardened Dockerfile, or the corrected security-group block — as a &lt;em&gt;starting point&lt;/em&gt; you review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reading pipeline logs.&lt;/strong&gt; Summarizing a failed, noisy CI job to find the one line that actually broke the build.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The non-negotiable rule, same as during an incident: &lt;strong&gt;AI summarizes and suggests; a human verifies and applies.&lt;/strong&gt; A model will confidently mislabel a vulnerability's severity or propose a "fix" that doesn't compile. Use it to compress the toil, never to make the final security call. We keep a &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; of these workflows, and the same judgment applies to AI-assisted &lt;a href="https://dev.to/dashboard/code-review/"&gt;infrastructure code review&lt;/a&gt; — assistive acceleration, human sign-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;There is no single "best" DevSecOps toolchain. The best one is the one your team will actually use &lt;em&gt;consistently&lt;/em&gt;. A perfect scanner that everyone learns to ignore protects nothing — coverage you don't act on is worthless. So start small: pick the tools that fit your pipeline, wire them in as jobs that fail fast on genuine risk, and earn the right to add the next layer. Secrets detection, dependency and image scanning, IaC checks, and signed artifacts will stop the overwhelming majority of how teams actually get compromised. Get those running and reliable first, keep the human in the loop on the AI-assisted parts, and grow the stack as your infrastructure — and your team's appetite — matures.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Security scan output and AI-generated remediations are assistive, not authoritative. Always verify findings and fixes against your own systems before applying them in production.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/best-devsecops-security-tools-cicd-pipeline-protection/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>devsecops</category>
      <category>cicd</category>
      <category>security</category>
    </item>
    <item>
      <title>Humanizing Artificial Intelligence for Log Analysis: Turning Raw Server Logs Into Clear DevOps Answers</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sat, 27 Jun 2026 19:21:04 +0000</pubDate>
      <link>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-for-log-analysis-turning-raw-server-logs-into-clear-devops-2dnm</link>
      <guid>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-for-log-analysis-turning-raw-server-logs-into-clear-devops-2dnm</guid>
      <description>&lt;p&gt;It's 2:14 a.m. and my phone is buzzing because a customer's instance won't get a floating IP. The alert is one line. The truth is somewhere in about forty thousand lines spread across &lt;code&gt;nova-compute&lt;/code&gt;, &lt;code&gt;neutron-server&lt;/code&gt;, the OVS agent, and &lt;code&gt;libvirtd&lt;/code&gt; — each with its own timestamp format, its own idea of what a "request" is, and its own favorite way of burying the actual error under a wall of stack traces. This is the part of the job nobody puts on a slide. You are not solving a hard problem yet. You are &lt;em&gt;finding&lt;/em&gt; the problem, and finding it is grep, scroll, swear, repeat.&lt;/p&gt;

&lt;p&gt;This is exactly where AI earns its keep — and exactly where most people misuse it. So let me be precise about what I mean by "humanizing AI" for log analysis, because the phrase has been beaten half to death by marketing.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI reads. You decide.
&lt;/h2&gt;

&lt;p&gt;Humanizing AI does not mean an autonomous bot that "handles the incident." It means using a language model for the one thing it is genuinely, freakishly good at: pattern-matching and correlating across huge volumes of text, and translating jargon into plain English. A model can read forty thousand lines faster than you can scroll one screen, notice that the &lt;code&gt;req-&lt;/code&gt; ID in &lt;code&gt;nova-compute&lt;/code&gt; shows up again in &lt;code&gt;neutron-server&lt;/code&gt; 1.2 seconds later attached to a binding failure, and tell you that in a sentence.&lt;/p&gt;

&lt;p&gt;What it must not do is pull the trigger. The model reads; you decide. It hands you ranked hypotheses and a command to &lt;em&gt;verify&lt;/em&gt; them — never a "fix" it wants to apply on its own. You are still the engineer on the hook when the change goes sideways, so you stay the final decision-maker. Every workflow below is built around that rule. I wrote a fuller treatment of this on my site at &lt;a href="https://devopsaitoolkit.com/blog/humanizing-artificial-intelligence-in-log-analysis/" rel="noopener noreferrer"&gt;devopsaitoolkit.com/blog/humanizing-artificial-intelligence-in-log-analysis&lt;/a&gt;, but the short version is: the human stays in the loop, and the loop is where the judgment lives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before anything else: redact
&lt;/h2&gt;

&lt;p&gt;You are about to paste production logs into a model. Stop. Logs leak. They carry bearer tokens, Keystone auth tokens, DB connection strings, private IPs, customer email addresses, and the occasional password somebody logged "temporarily" in 2021. Treat every log line as hostile until you have stripped it.&lt;/p&gt;

&lt;p&gt;I keep a redaction pass that runs &lt;em&gt;before&lt;/em&gt; anything leaves the box:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; nova-compute &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"10 min ago"&lt;/span&gt; &lt;span class="nt"&gt;--no-pager&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/(password|passwd|secret|token|api[_-]?key)["'&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s1"&gt;' :=]+[^ ,"]+/\1=REDACTED/gi'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}/REDACTED_EMAIL/g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/\b([0-9]{1,3}\.){3}[0-9]{1,3}\b/REDACTED_IP/g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s1"&gt;'s/Bearer [A-Za-z0-9._-]+/Bearer REDACTED/g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/nova-redacted.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is not perfect — no regex is — but it catches the obvious offenders, and it forces you to look at what you are about to share. Eyeball the output before you paste it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Build the redaction step into the same command that pulls the logs, never as a separate afterthought. The moment "redact later" becomes a step you do by hand, you will skip it at 2 a.m. when it matters most. A pipe that always redacts is a habit; a checklist item is a future incident.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Linux logs: journald and syslog
&lt;/h2&gt;

&lt;p&gt;Start where most things start: the host. &lt;code&gt;journalctl&lt;/code&gt; is your front door, and a model is far better at reading its firehose output than you are when you're tired.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-p&lt;/span&gt; err &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"today"&lt;/span&gt; &lt;span class="nt"&gt;--no-pager&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; short-iso &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'s/Bearer [A-Za-z0-9._-]+/Bearer REDACTED/g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/host-errors.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mistake people make is pasting raw lines and asking "what's wrong?" That gets you a confident, useless summary. Give the model the &lt;em&gt;shape&lt;/em&gt; of what you want: timeline, correlation, and a verification step. The prompt matters more than the model. I keep a running set of patterns for this in &lt;a href="https://devopsaitoolkit.com/blog/analyzing-journald-logs-with-journalctl-and-ai/" rel="noopener noreferrer"&gt;my journald-with-AI write-up&lt;/a&gt;, but the core ask is always the same: "Group these by service and time, tell me which errors are causes versus symptoms, and give me a command to confirm before I touch anything."&lt;/p&gt;

&lt;p&gt;That last clause is the whole game. A model will happily tell you "restart the OVS agent." A &lt;em&gt;humanized&lt;/em&gt; workflow makes it tell you how to check whether the OVS agent is actually the problem first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Application and container logs: feed the right context
&lt;/h2&gt;

&lt;p&gt;Container logs are where context discipline pays off. A crash-looping pod's &lt;em&gt;current&lt;/em&gt; logs are often the least useful thing you can read — the interesting failure happened in the instance that already died. So you reach for the previous container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs deploy/payments &lt;span class="nt"&gt;-c&lt;/span&gt; api &lt;span class="nt"&gt;--previous&lt;/span&gt; &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;500 &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'s/(authorization|cookie):.*/\1: REDACTED/gi'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/payments-prev.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then pair it with the events, because the pod logs rarely tell you &lt;em&gt;why&lt;/em&gt; Kubernetes killed the thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get events &lt;span class="nt"&gt;--field-selector&lt;/span&gt; involvedObject.name&lt;span class="o"&gt;=&lt;/span&gt;payments-7d9f-abc &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.lastTimestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you hand the model both: the previous container logs &lt;em&gt;and&lt;/em&gt; the events. The events say "OOMKilled" or "readiness probe failed"; the application logs say what the process was doing in its last breath. The model's job is to connect those two stories. On its own, neither one is conclusive. Together they usually are — and if you only feed one, the model will hallucinate the other half. Garbage context in, confident nonsense out.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: When you share container logs with a model, always say which container and whether it's &lt;code&gt;--previous&lt;/code&gt; or current. "Here are the logs" is a trap — the model can't see your kubectl flags, and a restart-loop's current logs versus its previous logs tell completely different stories. Label your evidence.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If your container logs live in Loki rather than &lt;code&gt;kubectl&lt;/code&gt;, the same principle holds — you're just pulling from LogQL instead. I walked through that whole flow, including how to keep the model honest against a Loki backend, in &lt;a href="https://devopsaitoolkit.com/blog/reading-loki-logs-with-ai/" rel="noopener noreferrer"&gt;reading Loki logs with AI&lt;/a&gt;. The pattern doesn't change with the backend; only the query language does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubernetes events: the cheap, ignored goldmine
&lt;/h2&gt;

&lt;p&gt;Events are the most under-read signal in a cluster. They expire, they're noisy, and they're written in half-jargon that's easy to skim past. That's precisely the kind of text a model parses well. Dump them wide and let it cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get events &lt;span class="nt"&gt;-A&lt;/span&gt; &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.lastTimestamp &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-Ev&lt;/span&gt; &lt;span class="s2"&gt;"Normal&lt;/span&gt;&lt;span class="se"&gt;\s&lt;/span&gt;&lt;span class="s2"&gt;+(Pulled|Created|Started|Scheduled)"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/events.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ask the model to bucket these by root cause, not by namespace — "show me which events are likely the same underlying problem reported by different controllers." A human scanning this sees a thousand lines. The model sees three problems wearing a thousand costumes. You still decide which of the three is worth waking someone up for. If you want a deeper dive on the cluster-side patterns, I keep my Kubernetes material at &lt;a href="https://devopsaitoolkit.com/categories/kubernetes-helm/" rel="noopener noreferrer"&gt;devopsaitoolkit.com/categories/kubernetes-helm&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  OpenStack: tracing one request across nova, neutron, and libvirt
&lt;/h2&gt;

&lt;p&gt;Here's the 2 a.m. floating-IP problem, and it's the best example of why correlation beats reading. In OpenStack, a single API call fans out across services, and the &lt;em&gt;only&lt;/em&gt; thread tying them together is the request ID. Find it, then chase it everywhere:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;REQ&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"req-8f2c1a4e-..."&lt;/span&gt;   &lt;span class="c"&gt;# pulled from the nova-api log for the failed call&lt;/span&gt;

&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$REQ&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  /var/log/nova/nova-compute.log &lt;span class="se"&gt;\&lt;/span&gt;
  /var/log/neutron/neutron-server.log &lt;span class="se"&gt;\&lt;/span&gt;
  /var/log/neutron/neutron-openvswitch-agent.log &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sed&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'s/\b([0-9]{1,3}\.){3}[0-9]{1,3}\b/REDACTED_IP/g'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/req-trace.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the libvirt side, the request ID won't follow you — &lt;code&gt;libvirtd&lt;/code&gt; and the per-instance QEMU logs key off the instance UUID and domain name instead, so you grab those by time window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;journalctl &lt;span class="nt"&gt;-u&lt;/span&gt; libvirtd &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"02:10"&lt;/span&gt; &lt;span class="nt"&gt;--until&lt;/span&gt; &lt;span class="s2"&gt;"02:16"&lt;/span&gt; &lt;span class="nt"&gt;--no-pager&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/libvirt-window.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you've got a stitched-together story across four logs in three formats. A human can read this. It just takes twenty minutes of cross-referencing timestamps while a customer waits. The model does the cross-referencing in seconds and — crucially — tells you which service &lt;em&gt;first&lt;/em&gt; reported a real failure versus which ones are just downstream noise reacting to it. That causal ordering is the hard part, and it's exactly what correlation-across-text is built for.&lt;/p&gt;

&lt;p&gt;Here's what a genuinely &lt;em&gt;humanized&lt;/em&gt; output looks like. Not a fix. Ranked hypotheses, confidence, and the next command to run:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Most likely (high confidence):&lt;/strong&gt; The &lt;code&gt;vif-plugged&lt;/code&gt; event from the OVS agent never reached &lt;code&gt;nova-compute&lt;/code&gt; within the timeout. &lt;code&gt;neutron-server&lt;/code&gt; logged the port as &lt;code&gt;BUILD&lt;/code&gt; at 02:11:43 but never transitioned to &lt;code&gt;ACTIVE&lt;/code&gt;. nova then timed out waiting for the network and rolled the instance back — so the floating-IP failure is a &lt;em&gt;symptom&lt;/em&gt;, not the cause.&lt;br&gt;
&lt;strong&gt;Verify before acting:&lt;/strong&gt; &lt;code&gt;openstack port show &amp;lt;PORT_ID&amp;gt; -f value -c status binding_vif_type&lt;/code&gt; — if status is &lt;code&gt;DOWN&lt;/code&gt; or &lt;code&gt;binding_failed&lt;/code&gt;, this hypothesis holds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Less likely (medium confidence):&lt;/strong&gt; OVS agent lost its connection to &lt;code&gt;neutron-server&lt;/code&gt; (RabbitMQ hiccup). Check with &lt;code&gt;journalctl -u neutron-openvswitch-agent --since 02:09 | grep -i "agent.*down\|AMQP"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unlikely (low confidence):&lt;/strong&gt; libvirt failed to attach the tap device. Only pursue if the port shows &lt;code&gt;ACTIVE&lt;/code&gt; but the domain XML is missing the interface.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Read that again, because it models the behavior you want. It commits to an ordering. It separates cause from symptom. And every single hypothesis ends in a &lt;em&gt;command you run&lt;/em&gt; to confirm or kill it — not an action the model took. You are still the one who runs &lt;code&gt;openstack port show&lt;/code&gt; and reads the result with your own eyes. If you want more OpenStack-specific debugging patterns, that's a whole category on my site at &lt;a href="https://devopsaitoolkit.com/categories/openstack/" rel="noopener noreferrer"&gt;devopsaitoolkit.com/categories/openstack&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Demand the model rank by confidence AND mark each hypothesis as cause or symptom. The single most expensive mistake in an incident is chasing a loud symptom while the quiet root cause keeps burning. Forcing that label turns the model from a search engine into a triage partner — but you still own the triage.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Make the model show its work
&lt;/h2&gt;

&lt;p&gt;The difference between a useful AI log workflow and a dangerous one comes down to one demand: &lt;strong&gt;show your reasoning and give me a command to verify.&lt;/strong&gt; A model that says "it's the OVS agent, restart it" is a liability. A model that says "I think it's the OVS agent &lt;em&gt;because&lt;/em&gt; these three log lines, and here's the &lt;code&gt;openstack port show&lt;/code&gt; to confirm" is a colleague.&lt;/p&gt;

&lt;p&gt;Build that demand into your prompt every time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quote the specific log lines you're reasoning from.&lt;/li&gt;
&lt;li&gt;Rank hypotheses by confidence and label each as cause or symptom.&lt;/li&gt;
&lt;li&gt;For each, give a read-only command that confirms or refutes it.&lt;/li&gt;
&lt;li&gt;Never output a mutating command as a recommendation — only diagnostics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last rule keeps the human firmly in control. The model can suggest you &lt;em&gt;look&lt;/em&gt; at something. It does not get to suggest you &lt;em&gt;change&lt;/em&gt; something, because the change is your call, with your context, and your name on the change ticket. AI reads; you decide. If you want to see this wired up as an actual assistant rather than a copy-paste loop, I run a free one you can poke at over at &lt;a href="https://devopsaitoolkit.com/dashboard/incident-response/" rel="noopener noreferrer"&gt;devopsaitoolkit.com/dashboard/incident-response&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part that's actually yours
&lt;/h2&gt;

&lt;p&gt;Strip away the tooling and here's what's left. The model is a phenomenal reader and translator. It collapses forty thousand lines into three hypotheses, turns OpenStack's internal dialect into a sentence a sleep-deprived human can act on, and never gets bored on line thirty-nine thousand. That is real leverage, and refusing to use it out of purism is just making your nights longer.&lt;/p&gt;

&lt;p&gt;But the decision is still yours. The model doesn't know that this customer is mid-migration and a "harmless" agent restart will nuke their in-flight transfer. It doesn't know your change-freeze window, your blast radius, or that the "obvious" fix burned you last quarter. That context lives in your head, and that's the irreducibly human part of the job. Humanizing AI means letting it do the reading so you have the energy left to do the deciding.&lt;/p&gt;

&lt;p&gt;So redact your logs, feed the right context, demand the reasoning and the verification command — and then &lt;em&gt;you&lt;/em&gt; run the command. That's the loop. That's the whole thing.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;James Joyner IV runs &lt;a href="https://devopsaitoolkit.com/work-with-me/" rel="noopener noreferrer"&gt;devopsaitoolkit.com&lt;/a&gt;, where he writes about running production OpenStack, Kubernetes, and observability without losing his mind. Try the free AI Incident Response Assistant for ranked, verify-first log triage, and if you write about this stuff too, the &lt;a href="https://devopsaitoolkit.com/prompt-packs/writing-humanizer/" rel="noopener noreferrer"&gt;Writing Humanizer pack&lt;/a&gt; keeps your prose sounding like you.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>observability</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>DevOps Security Best Practices Every Engineering Team Should Follow</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Fri, 26 Jun 2026 14:01:11 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/devops-security-best-practices-every-engineering-team-should-follow-18k5</link>
      <guid>https://dev.to/devopsaitoolkit/devops-security-best-practices-every-engineering-team-should-follow-18k5</guid>
      <description>&lt;p&gt;I've spent 25 years securing Linux boxes, cloud accounts, CI/CD pipelines, and production clusters. The single most consistent lesson across all of it is this: the teams that get breached aren't the ones who lacked a security department. They're the ones who treated security as something a separate department would handle &lt;em&gt;later&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Security is not a phase. It's not a gate at the end of the pipeline, and it's not a quarterly audit. It's a property of how you write infrastructure code, manage secrets, ship containers, and run production every single day. When security lives inside the daily workflow — in the merge request, the pipeline stage, the Terraform plan — it costs almost nothing. When it lives in a separate review at the end, it's expensive, late, and routinely skipped.&lt;/p&gt;

&lt;p&gt;This is the checklist I'd hand a new engineering team. Everything here is defensive: hardening, detection, and recovery. Work through it section by section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DevOps security belongs in the daily workflow
&lt;/h2&gt;

&lt;p&gt;The whole premise of DevOps was to stop throwing work over the wall between dev and ops. Security is the last wall standing in most orgs, and it has to come down the same way: by moving the controls into the tools engineers already use.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat every pull/merge request as a security review surface, not just a code review.&lt;/li&gt;
&lt;li&gt;Run security checks as pipeline stages that &lt;em&gt;fail the build&lt;/em&gt;, not as advisory reports nobody reads.&lt;/li&gt;
&lt;li&gt;Make the secure path the easy path — a hardened base image, a vetted Terraform module, a secrets helper — so engineers don't route around it.&lt;/li&gt;
&lt;li&gt;Assign a security owner per service, not a security team for the whole company. Ownership beats oversight.&lt;/li&gt;
&lt;li&gt;Measure mean-time-to-remediate for vulnerabilities the same way you measure deploy frequency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a control only exists in a wiki page, it doesn't exist. If it exists in the pipeline, it's real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secure access control and least privilege
&lt;/h2&gt;

&lt;p&gt;Most incidents I've cleaned up came down to one over-privileged credential. Least privilege is boring and it's the highest-leverage thing on this list.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default every IAM role, Kubernetes ServiceAccount, and Linux user to &lt;em&gt;zero&lt;/em&gt; permissions, then add only what's needed.&lt;/li&gt;
&lt;li&gt;Replace standing admin access with just-in-time elevation that expires automatically.&lt;/li&gt;
&lt;li&gt;Scope cloud roles to specific resources and actions — no &lt;code&gt;*:*&lt;/code&gt; policies, ever.&lt;/li&gt;
&lt;li&gt;In Kubernetes, use RBAC &lt;code&gt;Roles&lt;/code&gt; bound to namespaces rather than &lt;code&gt;ClusterRole&lt;/code&gt; bindings wherever possible.&lt;/li&gt;
&lt;li&gt;Separate human identities from machine identities. Humans get SSO; services get workload identity.&lt;/li&gt;
&lt;li&gt;Audit who can &lt;code&gt;sudo&lt;/code&gt;, who's in the &lt;code&gt;docker&lt;/code&gt; group (that's root-equivalent), and who holds cloud admin — quarterly, in writing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  SSH key management and MFA
&lt;/h2&gt;

&lt;p&gt;SSH is still how a huge amount of production gets touched, and it's still where credential hygiene quietly rots.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disable password authentication entirely: &lt;code&gt;PasswordAuthentication no&lt;/code&gt; and &lt;code&gt;PermitRootLogin no&lt;/code&gt; in &lt;code&gt;sshd_config&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Use per-user keys, never a shared key passed around in a chat thread.&lt;/li&gt;
&lt;li&gt;Prefer short-lived SSH certificates from a CA over long-lived static keys; rotate the rest on a schedule.&lt;/li&gt;
&lt;li&gt;Put a bastion/jump host in front of production and log every session through it.&lt;/li&gt;
&lt;li&gt;Require MFA on every identity provider, VPN, and cloud console — phishing-resistant (WebAuthn/hardware keys) for anyone with production access.&lt;/li&gt;
&lt;li&gt;Pull keys for departed team members the same day, and audit &lt;code&gt;authorized_keys&lt;/code&gt; files for orphans.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Secrets management: API keys, passwords, and tokens
&lt;/h2&gt;

&lt;p&gt;The fastest way to leak a secret is to commit it. The second fastest is to print it. Both are entirely preventable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never store secrets in git — not in code, not in &lt;code&gt;.env&lt;/code&gt;, not in a "temporary" YAML file. Add a pre-commit secret scanner (gitleaks or trufflehog) to block it.&lt;/li&gt;
&lt;li&gt;Centralize secrets in a real secrets manager: HashiCorp Vault, a cloud secrets manager, or equivalent.&lt;/li&gt;
&lt;li&gt;For Kubernetes, use Sealed Secrets or an external-secrets operator so the cluster pulls from Vault at runtime — plain &lt;code&gt;Secret&lt;/code&gt; objects are only base64, not encrypted.&lt;/li&gt;
&lt;li&gt;Give every secret a rotation policy and an owner. Static credentials that never rotate are time bombs.&lt;/li&gt;
&lt;li&gt;Inject secrets as runtime environment values or mounted files, not baked into container images or Terraform state.&lt;/li&gt;
&lt;li&gt;Scan your git history, not just the current tree — a secret deleted in HEAD is still in the log until you rotate it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CI/CD pipeline security
&lt;/h2&gt;

&lt;p&gt;Your pipeline has credentials to everything. That makes it one of the highest-value targets you own, and it's frequently the least hardened.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Protect your main branches: require reviews, status checks, and signed commits before merge.&lt;/li&gt;
&lt;li&gt;Mark CI/CD variables as &lt;strong&gt;protected&lt;/strong&gt; and &lt;strong&gt;masked&lt;/strong&gt; so they're only exposed on protected branches and never echoed to logs.&lt;/li&gt;
&lt;li&gt;In GitLab CI, scope variables to environments and never &lt;code&gt;echo&lt;/code&gt; a secret — masking helps, but the discipline of not printing it is what saves you.&lt;/li&gt;
&lt;li&gt;Replace long-lived cloud keys in CI with short-lived credentials via OIDC. Let the pipeline exchange its identity for a temporary, scoped token instead of holding a static &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pin and review your pipeline dependencies — third-party CI templates and actions run with your pipeline's privileges.&lt;/li&gt;
&lt;li&gt;Require manual approval for production deploys, and make the deploy job itself least-privileged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A leaked CI variable is a leaked production credential. Treat the pipeline config with the same care you'd treat root.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container image scanning
&lt;/h2&gt;

&lt;p&gt;A container is only as trustworthy as the layers underneath it. Most images ship with known CVEs the team never looked at.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scan every image with Trivy (or Grype) as a GitLab pipeline stage &lt;strong&gt;before push&lt;/strong&gt;, and fail the build on high/critical findings:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;container_scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasec/trivy:latest&lt;/span&gt;
    &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;trivy image --exit-code 1 --severity HIGH,CRITICAL "$IMAGE_TAG"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Start from minimal base images (distroless or slim) to shrink the attack surface.&lt;/li&gt;
&lt;li&gt;Run containers as a non-root user (&lt;code&gt;USER&lt;/code&gt; in the Dockerfile) with a read-only root filesystem where possible.&lt;/li&gt;
&lt;li&gt;Drop all Linux capabilities and add back only what's required.&lt;/li&gt;
&lt;li&gt;Pin base images by digest, not by floating &lt;code&gt;:latest&lt;/code&gt; tags, and rebuild regularly to pick up patches.&lt;/li&gt;
&lt;li&gt;Sign images and verify signatures at admission so only your builds run in your cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code security
&lt;/h2&gt;

&lt;p&gt;IaC is where a one-line mistake becomes a fleet-wide misconfiguration. The good news: it's also where automated policy catches it before it ships.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review Terraform and Ansible changes like application code — every change goes through a merge request with a human reviewer.&lt;/li&gt;
&lt;li&gt;Run static analysis on IaC in the pipeline: &lt;code&gt;tfsec&lt;/code&gt;/Checkov for Terraform, &lt;code&gt;ansible-lint&lt;/code&gt; and &lt;code&gt;kube-linter&lt;/code&gt; for the rest.&lt;/li&gt;
&lt;li&gt;Adopt policy-as-code (OPA/Conftest or Sentinel) so rules like "no public S3 buckets" and "no &lt;code&gt;0.0.0.0/0&lt;/code&gt; on port 22" are enforced automatically, not remembered by reviewers.&lt;/li&gt;
&lt;li&gt;Protect and encrypt Terraform state — it contains secrets in plaintext. Use a remote backend with locking and access controls.&lt;/li&gt;
&lt;li&gt;For Ansible, encrypt sensitive variables with Vault and avoid &lt;code&gt;become&lt;/code&gt; where it isn't needed.&lt;/li&gt;
&lt;li&gt;Diff the &lt;code&gt;plan&lt;/code&gt; before every &lt;code&gt;apply&lt;/code&gt; and require approval for changes to security groups, IAM, and networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want a structured second opinion on a risky module, an automated &lt;a href="https://dev.to/dashboard/code-review/"&gt;infrastructure code review&lt;/a&gt; catches the misconfigurations a tired reviewer skims past at the end of the day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patch management and vulnerability remediation
&lt;/h2&gt;

&lt;p&gt;Unpatched systems are the most common root cause of breaches, and the least glamorous to fix. Make it routine so it isn't a decision.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automate OS patching with unattended security updates on Linux, and rebuild container images on a cadence rather than letting them age.&lt;/li&gt;
&lt;li&gt;Track your dependencies with SBOMs so you can answer "are we affected?" the day a CVE drops.&lt;/li&gt;
&lt;li&gt;Subscribe to advisories for your stack and define an SLA: criticals patched in days, highs in a week or two.&lt;/li&gt;
&lt;li&gt;Use Dependabot/Renovate to open dependency-bump PRs automatically and run them through your test suite.&lt;/li&gt;
&lt;li&gt;Keep an inventory of every host, image, and cluster version — you can't patch what you don't know you run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Monitoring, logging, and alerting for security events
&lt;/h2&gt;

&lt;p&gt;You cannot respond to what you can't see. Detection is the difference between a contained incident and a postmortem that starts with "we think they were in for three months."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable Linux &lt;code&gt;auditd&lt;/code&gt; and ship &lt;code&gt;/var/log/auth.log&lt;/code&gt;, sudo events, and SSH activity to centralized, append-only storage.&lt;/li&gt;
&lt;li&gt;Export security metrics to Prometheus and build Grafana dashboards plus alerts for anomalies: failed-login spikes, new sudo grants, unexpected outbound connections, root logins, container escapes.&lt;/li&gt;
&lt;li&gt;Alert on auditd/SSH anomalies in Grafana — a burst of failed SSH from a new ASN, or a successful root login outside business hours, should page someone.&lt;/li&gt;
&lt;li&gt;Turn on cloud audit logging (CloudTrail or equivalent) and alert on IAM policy changes, new access keys, and security-group edits.&lt;/li&gt;
&lt;li&gt;Capture Kubernetes audit logs and alert on &lt;code&gt;exec&lt;/code&gt; into production pods and changes to RBAC.&lt;/li&gt;
&lt;li&gt;Keep logs immutable and retained long enough to investigate a slow-burn intrusion.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Backup and disaster recovery planning
&lt;/h2&gt;

&lt;p&gt;Ransomware and fat-fingered &lt;code&gt;terraform destroy&lt;/code&gt; have the same fix: backups you've actually tested. Untested backups are just hope.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow 3-2-1: three copies, two media types, one off-site and offline.&lt;/li&gt;
&lt;li&gt;Keep at least one immutable/air-gapped copy that a compromised admin credential can't delete.&lt;/li&gt;
&lt;li&gt;Encrypt backups at rest and control who can read and restore them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test restores on a schedule.&lt;/strong&gt; A backup you've never restored is a guess, not a recovery plan.&lt;/li&gt;
&lt;li&gt;Document RPO and RTO per service and verify your backup cadence actually meets them.&lt;/li&gt;
&lt;li&gt;Back up the things people forget: Terraform state, Vault data, etcd, and database credentials.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Incident response preparation
&lt;/h2&gt;

&lt;p&gt;The middle of an incident is the worst time to figure out your incident process. Prepare while it's calm.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write a runbook: who's on call, how to declare an incident, where to communicate, and how to reach the right people fast.&lt;/li&gt;
&lt;li&gt;Pre-define severity levels and the actions each triggers.&lt;/li&gt;
&lt;li&gt;Keep break-glass credentials in a sealed, audited path — available in a crisis, logged when used.&lt;/li&gt;
&lt;li&gt;Practice. Run a tabletop or game day at least quarterly so the steps are muscle memory.&lt;/li&gt;
&lt;li&gt;Have communication templates ready — customer-facing and internal — so comms don't stall the investigation.&lt;/li&gt;
&lt;li&gt;Draft blameless postmortems while the timeline is fresh, and turn action items into tracked work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI-assisted security checks — review, never blind trust
&lt;/h2&gt;

&lt;p&gt;AI is genuinely useful for security work: it reads more YAML, Terraform, and logs than you can, and it's fast at spotting the misconfiguration buried in a 400-line diff. The rule is the same one I apply to everything an AI generates — &lt;strong&gt;it drafts, you decide.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use AI to review IaC and pipeline configs for missing controls, over-broad permissions, and risky defaults — as a first-pass reviewer, not the final word.&lt;/li&gt;
&lt;li&gt;Have it summarize scanner output and rank findings by real-world exploitability so your team fixes what matters first.&lt;/li&gt;
&lt;li&gt;Let it draft hardening changes and policy rules, then read every line before you apply it — confident and correct are not the same thing.&lt;/li&gt;
&lt;li&gt;Never paste live secrets, real hostnames, or customer data into a model. Scrub first.&lt;/li&gt;
&lt;li&gt;Keep a human approving every change that touches IAM, networking, or production. AI accelerates the work; it doesn't own the risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want vetted starting points, our &lt;a href="https://dev.to/categories/security-hardening/"&gt;security &amp;amp; hardening prompts&lt;/a&gt; cover image scanning, Linux hardening, and IaC review, and the broader &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; has the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;None of these practices are exotic. Least privilege, managed secrets, scanned images, policy-checked infrastructure, patched hosts, real monitoring, tested backups, and a rehearsed incident plan — every one is achievable this quarter, and every one is cheaper to build into the workflow than to bolt on after a breach.&lt;/p&gt;

&lt;p&gt;And here's the part that doesn't show up on the engineering scorecard: secure DevOps is a competitive advantage. Customers run security questionnaires before they sign. Investors ask about your posture in diligence. Partners won't integrate with a platform they don't trust. The companies that win those deals are the ones who can show, not claim, that protecting customer systems is built into how they work every day.&lt;/p&gt;

&lt;p&gt;Security isn't the tax you pay to ship. It's part of why people let you ship to them at all.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/devops-security-best-practices-engineering-teams/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>devsecops</category>
      <category>hardening</category>
    </item>
    <item>
      <title>Humanizing Artificial Intelligence for SRE Teams: Reducing Alert Fatigue With Smarter AI Guidance</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 25 Jun 2026 14:27:13 +0000</pubDate>
      <link>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-for-sre-teams-reducing-alert-fatigue-with-smarter-ai-guidance-3411</link>
      <guid>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-for-sre-teams-reducing-alert-fatigue-with-smarter-ai-guidance-3411</guid>
      <description>&lt;p&gt;The pager goes off at 3:11 a.m. It's the fifth time tonight, and it's the same alert: &lt;code&gt;HighMemoryUsage&lt;/code&gt; on a node that's running a memory-mapped cache doing exactly what it was designed to do. You ack it half-asleep, knowing it'll fire again in twelve minutes. By the time the real incident shows up at 4:40 — a slow API degradation that's quietly eating your error budget — you're too fried to see it clearly. That's not a tooling failure. That's a design failure, and most of us have lived it.&lt;/p&gt;

&lt;p&gt;I run production OpenStack, Kubernetes, Terraform, and the observability stack that watches all of it. I've spent more nights than I'd like fighting my own alerts. So when "AI for SRE" started showing up in every vendor deck, my first reaction was a tired no. The last thing on-call needs is an autonomous bot deciding to restart my database at 3 a.m. based on a hunch.&lt;/p&gt;

&lt;p&gt;But there's a version of this that actually works, and it has nothing to do with autonomy. It's about using AI for the narrow set of things it's genuinely good at — clustering noise, summarizing storms, drafting hypotheses, surfacing the right runbook — while a human stays the final decision-maker. AI triages and proposes. You decide what pages a human and what gets fixed. That's what I mean by &lt;em&gt;humanizing&lt;/em&gt; AI: not making the machine more human, but using the machine to keep the on-call human rested, focused, and in control.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem is signal, not tooling
&lt;/h2&gt;

&lt;p&gt;Alert fatigue isn't caused by too few dashboards. It's caused by alerts that don't map to a human decision. Every page should answer one question: &lt;em&gt;does a person need to do something right now?&lt;/em&gt; If the answer is "no" or "not yet," it shouldn't page. We all know this. We violate it constantly because writing good alert rules is hard, and tuning them is a chore nobody prioritizes until they're drowning.&lt;/p&gt;

&lt;p&gt;The honest baseline, before any AI enters the picture, is rule hygiene. If your alerts are symptom-based, tied to user-facing SLOs, and have sane thresholds, you've already eliminated most of the noise. I've written about this at length in &lt;a href="https://devopsaitoolkit.com/blog/designing-alert-rules-that-dont-page-you-falsely/" rel="noopener noreferrer"&gt;designing alert rules that don't page you falsely&lt;/a&gt;, and the short version is: alert on what the user feels, not on what a single machine is doing. A node at 92% memory is not an incident. A checkout flow at 92% success rate is.&lt;/p&gt;

&lt;p&gt;Here's a symptom-based, multi-window burn-rate alert — the kind that respects your error budget instead of paging on every blip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slo-burn-rate&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CheckoutErrorBudgetFastBurn&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;(&lt;/span&gt;
            &lt;span class="s"&gt;job:slo_errors:ratio_rate5m{job="checkout"} &amp;gt; (14.4 * 0.001)&lt;/span&gt;
            &lt;span class="s"&gt;and&lt;/span&gt;
            &lt;span class="s"&gt;job:slo_errors:ratio_rate1h{job="checkout"} &amp;gt; (14.4 * 0.001)&lt;/span&gt;
          &lt;span class="s"&gt;)&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;page&lt;/span&gt;
          &lt;span class="na"&gt;slo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout-availability&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;burning&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;30-day&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;budget&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;14.4x&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;too&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fast"&lt;/span&gt;
          &lt;span class="na"&gt;runbook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://runbooks.internal/checkout-availability"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;14.4&lt;/code&gt; multiplier on a 0.1% error budget is the classic fast-burn threshold: at that rate you'd exhaust a 30-day budget in roughly two days, so it deserves a human now. Pair it with a slower window for the grind-it-down failures. If multi-window burn rate is new to you, &lt;a href="https://devopsaitoolkit.com/blog/multi-window-burn-rate-alerts-for-slos-that-work/" rel="noopener noreferrer"&gt;this walkthrough&lt;/a&gt; is the most practical reference I keep handing to my team.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Before you let AI touch a single alert, audit how many of your current pages are actionable. Pull a month of pages and bucket them: "I took an action," "I acked and ignored," "false positive." If more than a third land in the last two buckets, fix your rules first. AI applied to a noisy rule set just gives you faster, more confident noise.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI earns its place: the alert storm
&lt;/h2&gt;

&lt;p&gt;Rule hygiene gets you far, but it doesn't solve the 3 a.m. storm. When a top-of-rack switch flaps or a Kubernetes node goes &lt;code&gt;NotReady&lt;/code&gt;, you don't get one clean alert. You get forty: pod restarts, probe failures, downstream latency, queue backups, dependent service timeouts. Each one is technically "real." Collectively, they're a wall of text that you have to mentally parse while half-conscious.&lt;/p&gt;

&lt;p&gt;This is the single best place to put AI to work, because it's a summarization and clustering problem — exactly what large language models are good at. You feed the model the firing alerts, recent deploys, and relevant topology, and ask it to do what a senior SRE does instinctively: collapse the storm into a root cause plus its downstream effects.&lt;/p&gt;

&lt;p&gt;A good humanized output looks like this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Alert storm summary — 41 alerts, last 6 min&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Most likely root cause (confidence: high):&lt;/strong&gt; Node &lt;code&gt;worker-prod-07&lt;/code&gt; went &lt;code&gt;NotReady&lt;/code&gt; at 03:08 UTC, 90s after the &lt;code&gt;ingress-nginx&lt;/code&gt; rollout (deploy #4821). 38 of 41 alerts are pods evicted from or routing through this node.&lt;br&gt;
&lt;strong&gt;Downstream effects (not separate incidents):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;checkout&lt;/code&gt; p99 latency +400ms — pods rescheduling&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;payment-worker&lt;/code&gt; queue depth rising — consumers restarting&lt;/li&gt;
&lt;li&gt;6x &lt;code&gt;KubePodCrashLooping&lt;/code&gt; — all on &lt;code&gt;worker-prod-07&lt;/code&gt;
&lt;strong&gt;Suggested next check:&lt;/strong&gt; &lt;code&gt;kubectl describe node worker-prod-07&lt;/code&gt; and the kubelet logs around 03:07. &lt;strong&gt;Recommended runbook:&lt;/strong&gt; node-notready-triage.
&lt;strong&gt;This summary is advisory. No action has been taken.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last line matters. The AI clustered 41 alerts into one root cause and three downstream effects, attached a confidence level, and pointed at the next diagnostic step. It did &lt;em&gt;not&lt;/em&gt; cordon the node, restart anything, or roll back the deploy. It handed me a hypothesis and a starting point, and I decide whether it's right. That cuts my time-to-understanding from "read forty alerts" to "verify one claim" — which is exactly where you want AI to save you minutes, because those minutes directly improve your MTTA and MTTR. (If you want to be rigorous about which &lt;a href="https://devopsaitoolkit.com/blog/incident-metrics-that-matter-mtta-mttr-mtbf/" rel="noopener noreferrer"&gt;incident metrics actually matter&lt;/a&gt;, MTTA is the one that alert fatigue quietly wrecks.)&lt;/p&gt;

&lt;p&gt;The mechanical clustering can — and should — live in Alertmanager. AI is the layer that &lt;em&gt;explains&lt;/em&gt; the cluster in plain language; Alertmanager is the layer that &lt;em&gt;suppresses&lt;/em&gt; the redundant pages so they never wake you in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Alertmanager do the suppression, let AI do the explaining
&lt;/h2&gt;

&lt;p&gt;Don't ask the LLM to be your routing engine. Deterministic grouping and inhibition rules are reliable, auditable, and free. Use them to fold the storm down before it ever reaches a human, then let AI summarize what's left.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;route&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oncall-pager&lt;/span&gt;
  &lt;span class="na"&gt;group_by&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;alertname'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cluster'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;node'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;group_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
  &lt;span class="na"&gt;group_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
  &lt;span class="na"&gt;repeat_interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4h&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;severity = page&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;oncall-pager&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;severity = ticket&lt;/span&gt;
      &lt;span class="na"&gt;receiver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ticket-queue&lt;/span&gt;

&lt;span class="na"&gt;inhibit_rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# If a node is NotReady, don't page on the pods it took down with it.&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;alertname = KubeNodeNotReady&lt;/span&gt;
    &lt;span class="na"&gt;target_matchers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;alertname =~ "KubePodCrashLooping|KubePodNotReady"&lt;/span&gt;
    &lt;span class="na"&gt;equal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;node'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That inhibition rule alone kills the most common storm pattern: a node failure dragging twenty pod alerts behind it. The &lt;code&gt;equal: ['node']&lt;/code&gt; clause scopes the suppression so a genuinely unrelated crashloop on a &lt;em&gt;different&lt;/em&gt; node still pages. This is the deterministic floor. AI sits on top of it to explain the handful of alerts that remain, not to replace it. There's more nuance to getting this grouping right in &lt;a href="https://devopsaitoolkit.com/blog/cutting-alert-noise-designing-alerts-engineers-actually-trust/" rel="noopener noreferrer"&gt;cutting alert noise and designing alerts engineers actually trust&lt;/a&gt;, which pairs well with this approach.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Treat every AI summary as a hypothesis with a confidence score, never a verdict. Require the model to emit a "next check" — a concrete command or query a human can run to confirm or refute it in under a minute. A confident-sounding root cause you can't quickly verify is worse than no suggestion at all, because it anchors your tired brain to the wrong path.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Prioritization: what actually deserves a human at 3 a.m.
&lt;/h2&gt;

&lt;p&gt;Clustering tells you &lt;em&gt;what&lt;/em&gt; is happening. Prioritization tells you &lt;em&gt;whether it can wait until morning&lt;/em&gt;. This is the judgment call that burns out on-call engineers, and it's another place AI can genuinely help — as long as it ranks and recommends rather than decides.&lt;/p&gt;

&lt;p&gt;The signal AI is good at synthesizing for prioritization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Blast radius:&lt;/strong&gt; how many users, services, or SLOs are affected, pulled from the alert labels and topology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget velocity:&lt;/strong&gt; is this burning error budget fast, slow, or not at all? A fast burn pages; a slow grind opens a ticket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business context:&lt;/strong&gt; is checkout down at peak, or is an internal batch job late at 4 a.m.?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recent change correlation:&lt;/strong&gt; did a deploy, config push, or Terraform apply land in the last 30 minutes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feed those signals to the model and ask it to propose a severity and a routing decision &lt;em&gt;with reasoning&lt;/em&gt;. The output should read like a recommendation you can override in one click:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Recommendation: open ticket, do not page (confidence: medium).&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;BatchExportLatencyHigh&lt;/code&gt; is firing, but it's an internal nightly export with no SLO and no user impact. Error budget for the user-facing API is untouched. No deploy correlated. Suggest ticket for the morning; re-evaluate if it's still firing after 2 retries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You, the human, glance at that and either agree or say "actually that export feeds the 6 a.m. finance report, page me." The AI never had the business context that the export feeds finance — and that's the point. It proposes; you supply the judgment it can't have. Over time you encode that judgment back into the rules so the AI gets it right next time. For a broader survey of what's actually useful here, I keep a running list of the &lt;a href="https://devopsaitoolkit.com/blog/best-ai-tools-for-sre-teams/" rel="noopener noreferrer"&gt;best AI tools for SRE teams&lt;/a&gt; and what each is realistically good for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Surfacing the runbook, not writing the fix
&lt;/h2&gt;

&lt;p&gt;When you do get paged for something real, the next time-sink is finding the right runbook and remembering the current state of the system. AI is good at retrieval and assembly: given the firing alert and its labels, it can surface the most relevant runbook, pull the last three related incidents, and assemble a quick situation brief.&lt;/p&gt;

&lt;p&gt;What it should produce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The runbook link that matches this alert's &lt;code&gt;runbook&lt;/code&gt; annotation, plus any siblings.&lt;/li&gt;
&lt;li&gt;A one-paragraph history: "This alert fired twice in the last 30 days; both were resolved by draining the node, mean resolution 14 minutes."&lt;/li&gt;
&lt;li&gt;The current relevant state: recent deploys, open changes, who's already in the channel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it should &lt;em&gt;not&lt;/em&gt; produce: the executed remediation. The model can draft the &lt;code&gt;kubectl drain&lt;/code&gt; command and explain why. A human reads it, sanity-checks the node, and runs it. The gap between "here's the command" and "I ran the command for you" is the entire safety margin of this approach. Drain the wrong node and you've turned one incident into two.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Wire your runbook links directly into alert annotations, like the &lt;code&gt;runbook:&lt;/code&gt; field in the burn-rate example above. AI retrieval is dramatically more reliable when every alert already carries a canonical pointer — you're asking the model to fetch and summarize a known document, not to guess which runbook applies. Garbage retrieval in, confident-but-wrong brief out.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the loop without a bot closing it for you
&lt;/h2&gt;

&lt;p&gt;The part everyone skips is the follow-through. The storm ends, the incident resolves, and the action items — "tune that node alert," "add an inhibition rule," "fix the probe timeout" — evaporate into a Slack thread nobody reads again. Then the same storm wakes you next month.&lt;/p&gt;

&lt;p&gt;AI helps here too, in its lane: it can draft the postmortem timeline from the alert and chat history, extract the action items, and propose the specific alert-rule changes that would have prevented the storm. But a human owns each item and a human merges the rule change. The discipline of &lt;a href="https://devopsaitoolkit.com/blog/closing-the-loop-making-incident-action-items-actually-get-done/" rel="noopener noreferrer"&gt;actually closing the loop on incident action items&lt;/a&gt; is what turns last night's pain into next month's quiet. AI can draft the fix; it can't care that it ships. That's still on us.&lt;/p&gt;

&lt;p&gt;This is the feedback loop that makes the whole system humane over time. Every overridden AI recommendation, every false page you tune out, every storm you cluster becomes encoded knowledge — better rules, better inhibition, better routing. The AI gets a sharper picture, the human gets quieter nights, and the error budget stays honest. If you want a deeper tour of the patterns, the &lt;a href="https://devopsaitoolkit.com/categories/incident-response/" rel="noopener noreferrer"&gt;incident response category&lt;/a&gt; on the site collects the workflows I actually run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The line you don't cross
&lt;/h2&gt;

&lt;p&gt;Let me be blunt about the boundary, because the vendors won't be. AI in your on-call loop should do four things: cluster noise, summarize storms, draft hypotheses, and surface runbooks. It should rank and recommend with explicit confidence levels and a verifiable next check. It should never silently take an action against production, never auto-close an incident, never decide on its own that something doesn't deserve a human.&lt;/p&gt;

&lt;p&gt;The moment you hand the machine the keys to remediation, you've traded one kind of 3 a.m. terror — the noisy pager — for a worse one: the bot that did something you didn't sanction and now you're reverse-engineering its decision under pressure. Keep the human as the final approver. Use AI to make that human faster, calmer, and better-informed. That's the entire game.&lt;/p&gt;

&lt;p&gt;The goal was never to remove the engineer from the loop. It's to make sure that when the pager does go off at 3 a.m. — and it will — it's for something that genuinely deserves you, with a clear summary, a ranked hypothesis, and the right runbook already in hand. That's a humane on-call. That's humanized AI.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;James Joyner IV runs &lt;a href="https://devopsaitoolkit.com/work-with-me/" rel="noopener noreferrer"&gt;devopsaitoolkit.com&lt;/a&gt;, where he writes about keeping production systems and the humans who run them healthy. If you want to see this human-in-control approach in action, try the free AI Incident Response Assistant — it summarizes and triages, but you stay the one who decides.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>ai</category>
      <category>observability</category>
    </item>
    <item>
      <title>Why AI Loves Ansible (And You Should Let It Help)</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 25 Jun 2026 04:36:48 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/why-ai-loves-ansible-and-you-should-let-it-help-3o2p</link>
      <guid>https://dev.to/devopsaitoolkit/why-ai-loves-ansible-and-you-should-let-it-help-3o2p</guid>
      <description>&lt;p&gt;If you compare how well Claude handles Ansible against how well it handles, say, raw bash or kubectl YAML, Ansible wins by a wide margin. The reason isn't subtle: Ansible's shape — declarative, idempotent, modules-with-arguments — happens to map almost perfectly to how LLMs reason. They're good at producing structured output that fills in a known template, and that's what most Ansible tasks are.&lt;/p&gt;

&lt;p&gt;This means AI-assisted Ansible work is the highest-leverage automation pairing I know of. If you only adopt AI for one infrastructure tool, make it Ansible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Ansible AI-friendly
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Modules have published contracts
&lt;/h3&gt;

&lt;p&gt;Every Ansible module has a documented argument spec: what's required, what's optional, what the defaults are. The model can fit your intent into the spec with high accuracy because the spec is finite and known.&lt;/p&gt;

&lt;p&gt;Compare this to shell: there are a thousand ways to "create a user with a specific UID, member of these groups, with this shell, and a home directory in this location." In bash, every distro is slightly different. In Ansible, you use &lt;code&gt;ansible.builtin.user&lt;/code&gt; with named arguments.&lt;/p&gt;

&lt;p&gt;The model gets this right &lt;em&gt;every single time&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Idempotency is the default
&lt;/h3&gt;

&lt;p&gt;When a model generates a Python script, it has to think about "what if this is run twice." When it generates Ansible, most modules handle that for free. The model can write the task, ignore the re-run case, and produce something that works.&lt;/p&gt;

&lt;p&gt;This means the cognitive load on both sides — model and human — is lower. You're describing the target state, not the procedure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Roles and structure are predictable
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;roles/foo/{defaults,vars,tasks,handlers,templates,files,meta}/main.yml&lt;/code&gt; — every Ansible role looks the same. The model can scaffold a new role in seconds because the layout is fixed.&lt;/p&gt;

&lt;p&gt;If you ask Claude to "create a new role for installing PostgreSQL 16 on Ubuntu 24.04 with default user &lt;code&gt;postgres&lt;/code&gt; and a tuned &lt;code&gt;postgresql.conf&lt;/code&gt;," you'll get a complete role structure with &lt;code&gt;defaults/main.yml&lt;/code&gt;, &lt;code&gt;tasks/main.yml&lt;/code&gt;, a Jinja template, and &lt;code&gt;handlers/main.yml&lt;/code&gt; — all consistent, all in the right places. The structure is constrained enough that the model rarely improvises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases where AI shines for Ansible
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Generating new roles from scratch
&lt;/h3&gt;

&lt;p&gt;This is the killer app. You can describe a role in two sentences and get a 90%-done implementation. You then refine: add validation, adjust defaults, write a README.&lt;/p&gt;

&lt;p&gt;I now treat "draft a new role with Claude" as the default first step. Even if I rewrite half of it, the structure saves me 20 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Converting shell scripts to playbooks
&lt;/h3&gt;

&lt;p&gt;If you have a legacy bash script that provisions a server, pasting it into Claude with "convert this to an idempotent Ansible playbook using the appropriate modules" produces a usable result. The model knows when to use &lt;code&gt;ansible.builtin.file&lt;/code&gt;, &lt;code&gt;lineinfile&lt;/code&gt;, &lt;code&gt;template&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;You'll need to verify the idempotency manually (run twice, expect 0 changes on the second run), but the conversion is mostly mechanical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refactoring playbooks to use FQCN
&lt;/h3&gt;

&lt;p&gt;Ansible 2.10+ wants fully-qualified collection names: &lt;code&gt;ansible.builtin.package&lt;/code&gt; instead of &lt;code&gt;package&lt;/code&gt;. Old playbooks have hundreds of short-form references. AI is a perfect fit for this kind of mass refactoring — it knows the mapping and won't get bored.&lt;/p&gt;

&lt;p&gt;Paste a 200-line playbook, ask for it back with FQCN throughout, and you're done in 30 seconds. Verify with &lt;code&gt;ansible-lint&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Molecule tests
&lt;/h3&gt;

&lt;p&gt;Molecule scaffolding is repetitive — same &lt;code&gt;molecule.yml&lt;/code&gt;, same &lt;code&gt;converge.yml&lt;/code&gt;, same &lt;code&gt;verify.yml&lt;/code&gt; structure for most roles. AI is great at generating the boilerplate. You describe what you want to test; the model writes the assertion playbook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jinja template generation
&lt;/h3&gt;

&lt;p&gt;Jinja is just structured-enough that AI handles it well — generating templates for config files (nginx, postgres, sshd) from a description of the desired behavior. The model knows the configuration keys and the conditional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI struggles with Ansible
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Variable precedence
&lt;/h3&gt;

&lt;p&gt;Ansible's 21-layer variable precedence rules are not intuitive. The model will sometimes suggest putting a variable in &lt;code&gt;vars/main.yml&lt;/code&gt; when you really want it in &lt;code&gt;defaults/main.yml&lt;/code&gt; (the former overrides the latter). The result: users of your role can't override the variable they expected to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When the model puts something in &lt;code&gt;vars/&lt;/code&gt;, ask "should this be overridable by the role user?" If yes, move to &lt;code&gt;defaults/&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom facts and &lt;code&gt;set_fact&lt;/code&gt; lifetime
&lt;/h3&gt;

&lt;p&gt;The model sometimes uses &lt;code&gt;set_fact&lt;/code&gt; for values that need to persist across plays, but doesn't add &lt;code&gt;cacheable: true&lt;/code&gt;. The fact is then gone after the play ends, and the next play sees &lt;code&gt;undefined&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When you use &lt;code&gt;set_fact&lt;/code&gt; for a value you need later, verify the lifetime is what you expect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vault integration
&lt;/h3&gt;

&lt;p&gt;The model will sometimes generate playbooks that reference &lt;code&gt;vault_db_password&lt;/code&gt; as a variable but don't include the &lt;code&gt;lookup('community.hashi_vault.hashi_vault', ...)&lt;/code&gt; call or the Ansible Vault encrypted file. You have to wire up the secret source separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; For any sensitive variable in a generated playbook, verify there's an actual source for it (Vault encrypted file, external manager lookup, environment variable).&lt;/p&gt;

&lt;h3&gt;
  
  
  Distro-specific paths
&lt;/h3&gt;

&lt;p&gt;The model defaults to Debian/Ubuntu conventions. If you run on RHEL, you'll sometimes get &lt;code&gt;apt&lt;/code&gt; modules in tasks that should be using the &lt;code&gt;package&lt;/code&gt; module (or distro conditionals).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When generating playbooks for non-Debian systems, audit for &lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;apt_repository&lt;/code&gt;, &lt;code&gt;dpkg_selections&lt;/code&gt;, and ask for the abstraction (&lt;code&gt;package&lt;/code&gt;) or the distro split.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that's been working for me
&lt;/h2&gt;

&lt;p&gt;For a new role, my process now looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Describe the role&lt;/strong&gt; to Claude in 2-3 sentences (purpose, target distros, key behaviors).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate the scaffolding&lt;/strong&gt;: &lt;code&gt;defaults/main.yml&lt;/code&gt;, &lt;code&gt;tasks/main.yml&lt;/code&gt;, a template if needed, &lt;code&gt;meta/main.yml&lt;/code&gt; with platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every task.&lt;/strong&gt; Look for the failure modes above (precedence, lifetime, Vault, distros).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add Molecule tests.&lt;/strong&gt; Have Claude scaffold &lt;code&gt;molecule/default/&lt;/code&gt;, then write the assertions yourself or ask for them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;ansible-lint&lt;/code&gt; and Molecule.&lt;/strong&gt; Fix what they catch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotence check.&lt;/strong&gt; Run the role twice; second run should report 0 changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refine the README.&lt;/strong&gt; This is the one place I write from scratch — explaining the role to future-me.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This takes maybe 30 minutes for a moderately complex role. Without AI assistance, the same role would take me a couple of hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on safety
&lt;/h2&gt;

&lt;p&gt;Ansible runs as root on production servers. Whatever the model generates, &lt;em&gt;you&lt;/em&gt; are responsible for what it does. Two patterns I follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check &lt;code&gt;--check --diff&lt;/code&gt; before any real run.&lt;/strong&gt; Dry-run the playbook in check mode; verify the diff matches what you expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test on a sandbox host first.&lt;/strong&gt; Especially for new roles. Don't trust the model with production until the role has run cleanly on a throwaway VM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the same disciplines that apply to any infrastructure change. AI doesn't change the discipline; it just makes you faster at the parts before the change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I think Ansible is the right entry point
&lt;/h2&gt;

&lt;p&gt;If you're new to using AI for infrastructure work and want to pick one tool to start with, Ansible is the safest, highest-leverage choice. The structure makes the AI accurate. The idempotency makes mistakes recoverable. The module ecosystem covers most common cases.&lt;/p&gt;

&lt;p&gt;By the time you've used AI to write a dozen Ansible playbooks, you'll have developed the intuition for what AI handles well and what needs human attention. That intuition transfers to harder tools — Terraform, Kubernetes, custom shell — where the cost of AI mistakes is higher.&lt;/p&gt;

&lt;p&gt;For our full set of AI-driven Ansible workflows, see the &lt;a href="https://dev.to/categories/iac/"&gt;IaC category&lt;/a&gt; — including &lt;a href="https://dev.to/prompts/ansible-vault-secrets-management/"&gt;ansible-vault-secrets-management&lt;/a&gt; and &lt;a href="https://dev.to/prompts/ansible-molecule-testing/"&gt;ansible-molecule-testing&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/why-ai-loves-ansible/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ansible</category>
      <category>ai</category>
      <category>automation</category>
    </item>
    <item>
      <title>Humanizing Artificial Intelligence for Platform Engineering: Helping Internal Developer Platforms Become Easier to Use</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Tue, 23 Jun 2026 15:10:35 +0000</pubDate>
      <link>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-for-platform-engineering-helping-internal-developer-platforms-6bc</link>
      <guid>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-for-platform-engineering-helping-internal-developer-platforms-6bc</guid>
      <description>&lt;h2&gt;
  
  
  The Developer Who Just Wanted to Ship a Service
&lt;/h2&gt;

&lt;p&gt;A backend developer on my team pinged me last week. She'd built a small Go service, it passed tests locally, and she wanted to get it running in our staging cluster. Simple ask. Then she opened the platform wiki.&lt;/p&gt;

&lt;p&gt;Forty pages. A section on naming conventions. A page on which Terraform module to use for a Postgres instance, with a note at the bottom saying that page was "mostly current." A Helm values reference. A link to a deprecated runbook that someone forgot to delete. By the time she found the right golden-path template, she'd burned half a day and pinged three people in Slack. She didn't need a tutorial on our platform. She needed someone to walk her to the right door.&lt;/p&gt;

&lt;p&gt;That gap is the actual problem with most internal developer platforms. The paved road exists. The templates exist. The policy is sound. But the cognitive load of &lt;em&gt;finding&lt;/em&gt; the paved road is so high that developers route around it, copy a YAML file from another team, and ship something that violates four of our standards. The platform isn't failing on capability. It's failing on usability.&lt;/p&gt;

&lt;p&gt;This is where I think AI earns its keep in platform engineering, and it's a narrower claim than the hype suggests. "Humanizing AI" here doesn't mean a chatbot that replaces your platform team. It means using AI as the friendly front door to your IDP: turning a plain-English request into the right golden-path template, scaffolding the boilerplate against an &lt;em&gt;approved&lt;/em&gt; template, and explaining the cryptic platform error so the developer fixes it correctly instead of hacking around it. The guardrails, the templates, the approvals, and the humans who own them all stay exactly where they are. AI lowers the friction of self-service. It does not bypass the paved road.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Service Is Only Self-Service If People Can Find It
&lt;/h2&gt;

&lt;p&gt;The promise of an IDP is that a developer can provision what they need without filing a ticket and waiting two days. The reality, on most platforms I've seen, is a self-service portal that assumes the developer already knows the vocabulary of the platform. Which module? Which environment tier? Which approval gate applies to a service that touches PII?&lt;/p&gt;

&lt;p&gt;A developer asking "I need a new Go service with Postgres" is asking a perfectly reasonable question in human terms. The platform answers in machine terms: pick a template ID, fill in fourteen variables, know in advance which of them are required. The translation layer between those two is exactly what an AI assistant is good at — and exactly where most teams have been doing the translation manually, in Slack, over and over, forever.&lt;/p&gt;

&lt;p&gt;The trick is grounding. An ungrounded AI will happily tell your developer to drop in some Terraform it saw on a blog in 2021. That's worse than the wiki, because it's confidently wrong. A grounded assistant — one wired to &lt;em&gt;your&lt;/em&gt; template catalog, &lt;em&gt;your&lt;/em&gt; golden paths, &lt;em&gt;your&lt;/em&gt; current docs via retrieval — recommends the paved road because the paved road is the only thing in its context.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Don't fine-tune a model on your platform and call it done. Ground it with retrieval over your live template repo, your Backstage software catalog, and your current runbooks. Templates change weekly; a fine-tuned model freezes a snapshot. Retrieval means the assistant recommends what's actually approved today, not what was approved the day someone trained it.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  From a Sentence to a Scaffold
&lt;/h2&gt;

&lt;p&gt;Here's the concrete artifact. A developer types into the platform assistant:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I need a new Go service called &lt;code&gt;billing-events&lt;/code&gt; with a Postgres database, staging and prod, and it needs to publish to our Kafka topic.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A grounded assistant doesn't generate freehand infrastructure. It maps that request onto the approved &lt;code&gt;go-service-pg&lt;/code&gt; golden-path template in our catalog and produces the scaffold the platform team already blessed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;billing-events/
  catalog-info.yaml        # Backstage entity, owner team pre-filled
  cmd/main.go              # service skeleton, our logging + health endpoints
  Dockerfile               # from the approved base image, pinned digest
  helm/
    values.yaml            # resource limits from the "small" tier preset
  terraform/
    main.tf                # calls module "rds" { source = "registry/our-org/rds/aws" version = "4.2.1" }
    kafka.tf               # topic ACL via the approved kafka module
  .github/workflows/
    deploy.yaml            # the standard pipeline, not a bespoke one
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Terraform it emits isn't invented. It calls our internal module registry at a pinned version, with the variables the template requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"billing_events_db"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"registry.internal/our-org/rds/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"4.2.1"&lt;/span&gt;
  &lt;span class="nx"&gt;instance_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db.t4g.medium"&lt;/span&gt;   &lt;span class="c1"&gt;# from the "small" tier&lt;/span&gt;
  &lt;span class="nx"&gt;multi_az&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;              &lt;span class="c1"&gt;# prod default, policy-enforced&lt;/span&gt;
  &lt;span class="nx"&gt;pii&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;             &lt;span class="c1"&gt;# tagged so policy bot knows the rules&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what the AI did and didn't do. It filled in the boilerplate, picked sane defaults from our tier presets, and tagged the resource so our policy engine can reason about it. It did &lt;em&gt;not&lt;/em&gt; choose the module version, invent a new instance class, or skip &lt;code&gt;multi_az&lt;/code&gt;. Those are constraints baked into the template, and the assistant is generating &lt;em&gt;into&lt;/em&gt; the template, not around it. If the developer asks for something the template can't express — a database class we don't support, a region we don't run in — the right answer from the assistant is "that's not on the paved road, here's who to talk to," not a creative workaround.&lt;/p&gt;

&lt;p&gt;If you're new to wiring AI into this kind of container-to-cluster flow, the walkthrough in &lt;a href="https://devopsaitoolkit.com/blog/from-dockerfile-to-first-kubernetes-deployment-with-ai/" rel="noopener noreferrer"&gt;from Dockerfile to first Kubernetes deployment with AI&lt;/a&gt; is a good ground-level starting point for what "generate against a known-good shape" looks like in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Golden Paths Need a Tour Guide, Not Just a Map
&lt;/h2&gt;

&lt;p&gt;A golden path is only golden if developers stay on it. The failure mode I see constantly: the platform team builds an excellent paved road, documents it beautifully, and then watches teams drift off it because at the moment of decision — 4 p.m., trying to ship before standup tomorrow — nobody reads the docs. They grab whatever's nearest.&lt;/p&gt;

&lt;p&gt;An AI assistant grounded in your golden paths changes the economics of that decision. Staying on the paved road becomes the &lt;em&gt;low-effort&lt;/em&gt; option, because the assistant does the lookup and the scaffolding for you. Drifting off becomes the high-effort option, because now you're writing Terraform by hand. That's the inversion you want. For years we've asked developers to do extra work to comply with the platform. Done right, AI makes compliance the path of least resistance.&lt;/p&gt;

&lt;p&gt;This matters most for the workflows that are genuinely hard to reason about. GitOps is a great example: the model of "Git is the source of truth, a controller reconciles the cluster to match" is powerful but unintuitive for someone shipping their first service. An assistant that can explain &lt;em&gt;why&lt;/em&gt; their change isn't live yet — "your manifest is merged, but Argo CD hasn't synced; here's the sync status and here's the app it belongs to" — is doing real developer support. If your platform runs this pattern, point people at the conceptual grounding in &lt;a href="https://devopsaitoolkit.com/blog/gitops-for-infrastructure-explained/" rel="noopener noreferrer"&gt;GitOps for infrastructure explained&lt;/a&gt; and the hands-on &lt;a href="https://devopsaitoolkit.com/blog/gitops-with-argo-cd-a-practical-starting-guide/" rel="noopener noreferrer"&gt;practical starting guide for Argo CD&lt;/a&gt;, and let the assistant answer the "but why isn't my thing deployed" questions against your actual setup.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Make the assistant cite its source. Every recommendation should link back to the template, runbook, or catalog entry it came from. This does two things: it lets the developer verify the paved road for themselves, and it gives your platform team a signal — when the assistant keeps citing a stale doc, that's your cue to fix the doc. The AI becomes a usage tracker for your documentation.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Explaining the Cryptic Error Instead of Inviting a Hack
&lt;/h2&gt;

&lt;p&gt;The highest-leverage place for AI in an IDP is the moment a developer hits a platform error they don't understand. This is where people hack. The pipeline fails with something opaque, the developer doesn't know what it means, and the fastest escape is to disable the check, add an &lt;code&gt;|| true&lt;/code&gt;, or copy a workaround from a Stack Overflow answer that has nothing to do with your platform.&lt;/p&gt;

&lt;p&gt;A good assistant intercepts that moment. Here's the kind of output I want to see when a deploy fails our policy gate:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Your deploy to &lt;code&gt;prod&lt;/code&gt; was blocked by the &lt;code&gt;require-resource-limits&lt;/code&gt; policy.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your &lt;code&gt;helm/values.yaml&lt;/code&gt; sets a CPU request but no CPU limit. Our prod admission policy requires both — this prevents one service from starving the node. You don't need to disable the check.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;go-service-pg&lt;/code&gt; golden path ships a &lt;code&gt;values-prod.yaml&lt;/code&gt; with the "small" tier preset that already includes compliant limits. You're missing it because this service was scaffolded before that preset existed. Copy the &lt;code&gt;resources:&lt;/code&gt; block from the template here: [link to template]. After that, re-run the deploy — no override needed.&lt;/p&gt;

&lt;p&gt;If you genuinely need higher limits than the preset allows, that's a capacity conversation with the platform team, not a policy bypass. Open a request here: [link].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That output respects the guardrail. It explains &lt;em&gt;why&lt;/em&gt; the policy exists, points to the golden-path fix, and explicitly closes the door on the hack — while leaving a legitimate human-in-the-loop escalation path for the real exception. Compare that to the developer's default move, which is to find the flag that makes the red text go away. The AI didn't weaken the policy. It made the policy &lt;em&gt;understandable&lt;/em&gt;, which is the only thing that makes a policy survive contact with a deadline.&lt;/p&gt;

&lt;p&gt;There's a deeper point here about toil. Every one of these "what does this error mean" questions used to land in the platform team's Slack channel, and answering them is pure, repetitive toil — the same Postgres-module question, the same policy-gate question, fifteen times a month. Routing those to a grounded assistant frees your senior people to work on the platform instead of staffing a help desk for it. I've written more about that specific dynamic in &lt;a href="https://devopsaitoolkit.com/blog/identifying-and-eliminating-toil-with-ai/" rel="noopener noreferrer"&gt;identifying and eliminating toil with AI&lt;/a&gt;, and platform support is one of the cleanest examples of toil I know.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Workflows: Narrate, Don't Drive
&lt;/h2&gt;

&lt;p&gt;I want to be precise about where AI sits in the deployment path, because this is where teams get nervous and they're right to. The assistant should narrate and assist the deployment workflow. It should not be the thing that pushes to prod.&lt;/p&gt;

&lt;p&gt;The pattern that works: the developer commits to Git, the pipeline runs, the controller reconciles — your normal, audited, human-approved flow. The AI's job is to make that flow legible. "Here's where your change is. It's merged, it passed CI, it's waiting on the prod approval gate, which is owned by your team lead." When something goes wrong, it explains the failure in the terms above. When the developer asks "how do I roll back," it points them at the actual rollback procedure in your runbook — the real one, retrieved live, not a generic Kubernetes answer.&lt;/p&gt;

&lt;p&gt;What it should never do is hold the credentials to apply infrastructure or promote a release on its own. The approval gates are the guardrail. A human owns the merge. A human owns the prod promotion. The AI is the assistant standing next to the control panel explaining what each button does — not the operator with its hands on the panel. If you build platform automation that AI helps developers operate, the &lt;a href="https://devopsaitoolkit.com/blog/kubernetes-operator-pattern-a-devops-engineers-guide/" rel="noopener noreferrer"&gt;Kubernetes operator pattern guide&lt;/a&gt; is worth reading for how to keep the reconciliation logic — the part that actually changes cluster state — in deterministic, audited controllers rather than in a probabilistic model.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Treat the AI assistant's outputs as suggestions that flow into your existing review and policy gates, never as a side channel that skips them. The scaffold it generates should land as a pull request a human reviews. The Terraform it writes should still hit &lt;code&gt;plan&lt;/code&gt;, policy-as-code, and approval. If your assistant can make a change that your normal pipeline couldn't, you've built a backdoor, not a front door.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Human Platform Team Is the Point
&lt;/h2&gt;

&lt;p&gt;It's tempting to read all of this as "AI replaces the platform team." It's the opposite. Every capability I've described depends on the platform team's work being good: the templates have to be right, the golden paths have to be real, the docs have to be current, the policies have to be sound. The AI is a multiplier on that work, not a substitute for it. Point it at a mediocre platform and you get a fast, confident way to spread mediocrity.&lt;/p&gt;

&lt;p&gt;The work shifts, though. Less time spent answering the same onboarding question for the hundredth time. More time spent making sure the templates the assistant recommends are excellent, because now they reach every developer instantly. Your golden paths become higher-stakes in the best way — they're no longer docs that maybe get read; they're the thing the front door walks everyone toward. That's a better job. It's the platform-engineering job, freed from the help-desk tax.&lt;/p&gt;

&lt;p&gt;"Humanizing AI" in this context means keeping humans in control of the road while using AI to make the road easy to walk. The developer who just wants to ship a Go service gets walked to the right template in thirty seconds instead of reading forty pages of wiki. The platform team keeps its guardrails, gains a tireless first-line guide, and gets back the hours it used to spend repeating itself. Nobody bypasses the paved road. The paved road just finally became the obvious one to take.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;James Joyner IV runs &lt;a href="https://devopsaitoolkit.com/work-with-me/" rel="noopener noreferrer"&gt;devopsaitoolkit.com&lt;/a&gt;, where he writes about running AI alongside real production platforms. If you want to put grounded, human-in-control prompts to work on your own platform, start with the &lt;a href="https://devopsaitoolkit.com/prompts/" rel="noopener noreferrer"&gt;prompt library&lt;/a&gt; or the &lt;a href="https://devopsaitoolkit.com/prompt-packs/writing-humanizer/" rel="noopener noreferrer"&gt;Writing Humanizer pack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>devops</category>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Vibe-Coded Infrastructure: How to Ship Fast Without Torching Production</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Mon, 22 Jun 2026 19:05:48 +0000</pubDate>
      <link>https://dev.to/jjoyneriv/vibe-coded-infrastructure-how-to-ship-fast-without-torching-production-3l83</link>
      <guid>https://dev.to/jjoyneriv/vibe-coded-infrastructure-how-to-ship-fast-without-torching-production-3l83</guid>
      <description>&lt;p&gt;You described what you wanted in plain English, the model wrote the Terraform, you ran &lt;code&gt;apply&lt;/code&gt;, and it worked. No docs, no Stack Overflow, no fighting HCL syntax for an hour. That's &lt;strong&gt;vibe coding&lt;/strong&gt; — building by describing intent and riding the model's output — and for infrastructure it is genuinely, addictively fast.&lt;/p&gt;

&lt;p&gt;It's also how you accidentally &lt;code&gt;terraform destroy&lt;/code&gt; a production VPC at 2 p.m. on a Tuesday.&lt;/p&gt;

&lt;p&gt;I run production OpenStack, Kubernetes, and Terraform for a living, and I vibe-code a &lt;em&gt;lot&lt;/em&gt; of it now. Here's the honest version of how to keep the speed without the smoking crater — the rules I actually follow.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe coding is great at drafts and terrible at consequences
&lt;/h2&gt;

&lt;p&gt;The thing a model is brilliant at is turning "I need a rate-limited NGINX reverse proxy in front of this service" into 40 lines of config in two seconds. The thing it has no idea about is that &lt;em&gt;this&lt;/em&gt; service shares an upstream with the billing API, that your last outage came from exactly this kind of change, and that the "harmless" reload will drop in-flight connections during your peak hour.&lt;/p&gt;

&lt;p&gt;The model writes code. It does not carry the consequences. You do. So the entire game of vibe-coding infrastructure safely is &lt;strong&gt;keeping the human on the hook for the blast radius&lt;/strong&gt; while letting the machine do the typing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 1: Vibe the draft, never the apply
&lt;/h2&gt;

&lt;p&gt;The single most important habit: the AI is allowed to &lt;em&gt;propose&lt;/em&gt; changes. It is never allowed to &lt;em&gt;make&lt;/em&gt; them. There's a world of difference between "here's the &lt;code&gt;kubectl patch&lt;/code&gt; you'd run" and a bot that runs it for you.&lt;/p&gt;

&lt;p&gt;Concretely, that means I demand a plan I can read before anything touches a real system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan
terraform show &lt;span class="nt"&gt;-json&lt;/span&gt; tfplan | &amp;lt;&lt;span class="nb"&gt;paste &lt;/span&gt;into the model&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;"Read this plan. Tell me in plain English what changes, flag anything that &lt;strong&gt;destroys or replaces&lt;/strong&gt; a resource, and rank the three riskiest changes. Don't tell me it's fine — tell me what could go wrong."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That &lt;code&gt;force-replace&lt;/code&gt; buried on line 200 is exactly the thing vibe-coding glosses over and a careful read catches. I wrote up the full version of this — &lt;a href="https://devopsaitoolkit.com/blog/dry-running-destructive-scripts-with-ai-before-prod/" rel="noopener noreferrer"&gt;parsing a Terraform plan for AI-assisted review&lt;/a&gt; — but the one-liner is: &lt;strong&gt;the model reads the plan; you approve the apply.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 2: Make destructive commands earn a second look
&lt;/h2&gt;

&lt;p&gt;Vibe-coded shell scripts are where people get hurt fastest, because bash will cheerfully &lt;code&gt;rm -rf "$DIR/"&lt;/code&gt; when &lt;code&gt;$DIR&lt;/code&gt; is empty. Before I run anything a model handed me, it goes through a risk pass:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Scan this script for anything destructive or irreversible — deletes, overwrites, force-pushes, drops, prod credentials. For each, tell me the blast radius and a safer version (dry-run flag, confirmation prompt, backup first)."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This catches the stuff vibe energy skips: missing &lt;code&gt;set -euo pipefail&lt;/code&gt;, an unquoted variable that word-splits, a &lt;code&gt;kubectl delete&lt;/code&gt; with no namespace scoping. I keep a whole pattern for &lt;a href="https://devopsaitoolkit.com/blog/catching-risky-shell-commands-before-they-run-with-ai/" rel="noopener noreferrer"&gt;catching risky shell commands before they run&lt;/a&gt;, and another for &lt;a href="https://devopsaitoolkit.com/blog/hardening-a-bash-script-with-ai-strict-mode-traps-back-out/" rel="noopener noreferrer"&gt;hardening a bash script with strict mode, traps, and back-out paths&lt;/a&gt;. Vibe-code the first draft; harden it before it runs once.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 3: Stage everything the vibe touched
&lt;/h2&gt;

&lt;p&gt;Vibe coding tempts you to skip the boring safety rails because the loop feels so fast. Don't. The fast loop &lt;em&gt;needs&lt;/em&gt; the rails or it's just a faster way to break things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dry-run first.&lt;/strong&gt; &lt;code&gt;--dry-run=server&lt;/code&gt;, &lt;code&gt;terraform plan&lt;/code&gt;, &lt;code&gt;ansible --check&lt;/code&gt;. If the tool has a no-op mode, the vibe-coded change runs there first, every time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smallest blast radius.&lt;/strong&gt; Apply to one node, one namespace, one non-prod env. Watch it. Then widen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Back-out before apply.&lt;/strong&gt; If you can't answer "how do I undo this in 60 seconds," you're not ready to apply it — no matter how confident the model sounded.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this slows the vibe down much. It just moves the "oh no" moment from production to a terminal where it's free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 4: Vibe-code the toil, hand-hold the crown jewels
&lt;/h2&gt;

&lt;p&gt;Not all infrastructure is equal. I'll vibe-code a Grafana dashboard, a CI lint job, a one-off migration script, or a dev-env bootstrap with barely a glance — the downside is a wasted ten minutes. I will &lt;em&gt;not&lt;/em&gt; vibe-code an IAM policy change, a database failover, a network ACL, or anything in the path of customer money without reading every line like it's a hostile PR.&lt;/p&gt;

&lt;p&gt;Match your scrutiny to the blast radius. The model doesn't know which is which; you do. (If you're standing up an AI helper that runs alongside real systems, the same principle scales — I wrote about &lt;a href="https://devopsaitoolkit.com/blog/building-an-ai-ops-copilot-with-guardrails/" rel="noopener noreferrer"&gt;building an AI ops copilot with guardrails&lt;/a&gt; that proposes and never silently acts.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Rule 5: Keep a prompt library so your vibes are reproducible
&lt;/h2&gt;

&lt;p&gt;The dirty secret of good vibe coding is that it's not actually vibes — it's good prompts. A throwaway "fix my nginx config" gets you a throwaway answer. A prompt that says "act as a senior SRE, here's my config and the error, give me ranked causes and the &lt;code&gt;nginx -t&lt;/code&gt; to verify before I reload" gets you something you can trust.&lt;/p&gt;

&lt;p&gt;I keep the ones that work in a &lt;a href="https://devopsaitoolkit.com/prompts/" rel="noopener noreferrer"&gt;searchable prompt library&lt;/a&gt; — filterable by stack, difficulty, and whether they include production-safety guidance — so the next time I vibe-code a Postgres index or a Kubernetes rollout, I'm starting from a prompt that already bakes in the guardrails. Reproducible vibes beat lucky ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest takeaway
&lt;/h2&gt;

&lt;p&gt;Vibe coding isn't the enemy of careful engineering — it's a power tool, and power tools are exactly as safe as the operator. Used well, you draft in seconds and spend your real attention on the 5% that can actually hurt you. Used badly, you ship confident nonsense at machine speed.&lt;/p&gt;

&lt;p&gt;So: vibe the draft, read the plan, stage the apply, scope the blast radius, and keep a human's name on the change ticket. Do that and "vibe-coded" stops being a confession and starts being a workflow.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write about running AI alongside real production infrastructure at &lt;a href="https://devopsaitoolkit.com/start-here/" rel="noopener noreferrer"&gt;devopsaitoolkit.com&lt;/a&gt;. New there? The &lt;a href="https://devopsaitoolkit.com/start-here/" rel="noopener noreferrer"&gt;start-here tour&lt;/a&gt; has the free toolkit and the incident assistant. And if your production is painful enough that vibe-coding won't fix it, I do &lt;a href="https://devopsaitoolkit.com/work-with-me/" rel="noopener noreferrer"&gt;fixed-price infrastructure audits&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>vibecoding</category>
      <category>devops</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
    <item>
      <title>AI for GitLab CI Authoring: Save Hours, Avoid Footguns</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sat, 20 Jun 2026 12:15:22 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/ai-for-gitlab-ci-authoring-save-hours-avoid-footguns-3lco</link>
      <guid>https://dev.to/devopsaitoolkit/ai-for-gitlab-ci-authoring-save-hours-avoid-footguns-3lco</guid>
      <description>&lt;p&gt;GitLab CI YAML is one of those formats where you can stare at it for an hour, get it 95% right, and have it fail with &lt;code&gt;yaml: line 12: did not find expected key&lt;/code&gt; because of a tab character. AI assistants are very fast at this kind of work. They're also confidently wrong about specific GitLab features in ways that waste a lot of time if you don't know what to check.&lt;/p&gt;

&lt;p&gt;After a year of letting Claude write a lot of my pipelines, here's what works and what doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI gets right consistently
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Standard job shapes
&lt;/h3&gt;

&lt;p&gt;"Write me a job that builds a Docker image, pushes to the GitLab Container Registry, and tags with the commit SHA and &lt;code&gt;latest&lt;/code&gt; on the default branch." Type that into Claude and you get a working job in five seconds. The shape is well-established and the model has seen thousands of variations.&lt;/p&gt;

&lt;p&gt;The same is true for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test jobs across languages (pytest, jest, go test, etc.)&lt;/li&gt;
&lt;li&gt;Standard cache configurations&lt;/li&gt;
&lt;li&gt;Standard artifact patterns&lt;/li&gt;
&lt;li&gt;Basic &lt;code&gt;rules:&lt;/code&gt; for branch / tag / MR pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find yourself writing one of these from scratch, you're spending time that you don't need to spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Translating from other CIs
&lt;/h3&gt;

&lt;p&gt;GitLab CI has obvious parallels to GitHub Actions, CircleCI, Jenkins declarative pipelines, etc. AI is &lt;em&gt;excellent&lt;/em&gt; at translating between them. The structures rhyme; the model knows the dictionary.&lt;/p&gt;

&lt;p&gt;If you're migrating from Actions to GitLab CI, paste the workflow and ask for the GitLab CI equivalent. You'll get something 80% right that you can refine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reviewing pipelines for inefficiency
&lt;/h3&gt;

&lt;p&gt;This is the underrated use case. Paste your &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; and ask: "what's the critical path of this pipeline, and what's making it slow?" The model will spot things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Your test job downloads node_modules from cache, but install-deps doesn't push to cache — your cache key is broken."&lt;/li&gt;
&lt;li&gt;"Your build and deploy stages are sequential but build's artifacts aren't used by deploy — they can be parallel with &lt;code&gt;needs:&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt;"Your &lt;code&gt;rules:changes:&lt;/code&gt; doesn't include &lt;code&gt;package-lock.json&lt;/code&gt;, so dependency changes don't retrigger tests."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real findings I've gotten from Claude on pipelines I thought I'd already optimized. Worth the five-minute review.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI gets wrong — and how to catch it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;rules:&lt;/code&gt; vs &lt;code&gt;only/except&lt;/code&gt; confusion
&lt;/h3&gt;

&lt;p&gt;The model will sometimes mix them in the same job. GitLab silently ignores &lt;code&gt;only:&lt;/code&gt; when &lt;code&gt;rules:&lt;/code&gt; is also defined. The pipeline runs but the behavior isn't what you expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Are you using &lt;code&gt;rules:&lt;/code&gt; OR &lt;code&gt;only:&lt;/code&gt;/&lt;code&gt;except:&lt;/code&gt; in each job? Pick one. (Use &lt;code&gt;rules:&lt;/code&gt; — &lt;code&gt;only/except&lt;/code&gt; is legacy.)&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;$CI_COMMIT_BRANCH&lt;/code&gt; empty on MR pipelines
&lt;/h3&gt;

&lt;p&gt;A common bug: you ask for "this job runs on the default branch" and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_BRANCH == "main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is correct for branch pipelines. It is &lt;strong&gt;empty&lt;/strong&gt; on MR (&lt;code&gt;merge_request_event&lt;/code&gt;) pipelines. If you have MR pipelines enabled, your job silently won't run when developers expect it to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Does your pipeline target both push events and MR events? If so, you probably want &lt;code&gt;$CI_MERGE_REQUEST_TARGET_BRANCH_NAME&lt;/code&gt; or to handle both pipeline sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;needs:&lt;/code&gt; referencing hidden jobs
&lt;/h3&gt;

&lt;p&gt;Hidden jobs (prefixed with &lt;code&gt;.&lt;/code&gt;) are templates — they don't execute. If you do &lt;code&gt;needs: [".lint"]&lt;/code&gt;, your job will fail with a confusing error because GitLab thinks you're depending on a job that doesn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Every &lt;code&gt;needs:&lt;/code&gt; entry should be a real job name, not a template.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-apply rules that don't include the right branches
&lt;/h3&gt;

&lt;p&gt;The model loves writing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_BRANCH == "main"&lt;/span&gt;
    &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;never&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works on &lt;code&gt;main&lt;/code&gt; but blocks the job on tags, on schedules, and on MR pipelines. Sometimes that's what you want. Often it's not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; What pipeline sources do you expect this job to run in? List them, then verify your rules cover each.&lt;/p&gt;

&lt;h3&gt;
  
  
  Imaginary GitLab features
&lt;/h3&gt;

&lt;p&gt;This is the most expensive AI failure mode. The model will sometimes generate syntax for features that don't exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;condition:&lt;/code&gt; field that's actually OPA/Conftest, not GitLab CI&lt;/li&gt;
&lt;li&gt;An &lt;code&gt;auto_retry:&lt;/code&gt; block that's GitHub Actions, not GitLab&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;before_script:&lt;/code&gt; keyword that does exist but with different semantics than the model claims&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; If you see a keyword you haven't seen before in GitLab docs, verify it exists. The lint endpoint (&lt;code&gt;/api/v4/ci/lint&lt;/code&gt;) catches most of these, but some pass lint and just behave weirdly.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that catches the failures cheaply
&lt;/h2&gt;

&lt;p&gt;I now do this for any non-trivial pipeline change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Draft with AI.&lt;/strong&gt; Describe the desired behavior in plain English; let the model write the YAML.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every line.&lt;/strong&gt; Treat the output as a draft you'd write yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint via the API.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"PRIVATE-TOKEN: &lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;content&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; .gitlab-ci.yml | jq &lt;span class="nt"&gt;-Rs&lt;/span&gt; .&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITLAB_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/v4/ci/lint"&lt;/span&gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run on a sandbox branch.&lt;/strong&gt; Push to a branch that won't trigger deploys; verify the pipeline runs the jobs you expect, when you expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff against the existing pipeline.&lt;/strong&gt; If the AI introduced changes you didn't ask for (a different cache key, a removed &lt;code&gt;interruptible:&lt;/code&gt;), revert them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 5 is the one most people skip. The model is good at writing YAML but not at preserving your previous decisions. If you don't diff, you'll lose your old cache strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical example
&lt;/h2&gt;

&lt;p&gt;Last month I needed to add a job that runs &lt;code&gt;terraform plan&lt;/code&gt; on every MR and posts the output as a comment. Drafted with Claude in two minutes; it produced something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;terraform-plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hashicorp/terraform:1.9&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform init&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform plan -out=tfplan -no-color&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform show -no-color tfplan &amp;gt; plan.txt&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;curl -X POST -H "PRIVATE-TOKEN: $GITLAB_API_TOKEN" \&lt;/span&gt;
          &lt;span class="s"&gt;-d "body=$(cat plan.txt | jq -Rs .)" \&lt;/span&gt;
          &lt;span class="s"&gt;"$CI_API_V4_URL/projects/$CI_PROJECT_ID/merge_requests/$CI_MERGE_REQUEST_IID/notes"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_PIPELINE_SOURCE == "merge_request_event"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;em&gt;almost&lt;/em&gt; right. Two issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PRIVATE-TOKEN&lt;/code&gt; as a CI variable&lt;/strong&gt; — using a personal access token for CI is the old pattern. Modern approach: use &lt;code&gt;$CI_JOB_TOKEN&lt;/code&gt; for in-instance API calls. Saves rotation pain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;terraform init -backend-config&lt;/code&gt;&lt;/strong&gt; — works if the backend is configured in code, but if you have multiple environments using the same module, you'd want to specify which backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both fixes are 30 seconds. Without the AI I'd have spent 15 minutes writing the curl invocation alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;AI doesn't replace knowing GitLab CI. It removes the typing and the boilerplate so you can spend your attention on the parts that matter — the &lt;code&gt;rules:&lt;/code&gt; logic, the cache keys, the secrets, the environment promotion.&lt;/p&gt;

&lt;p&gt;Once you've internalized the failure modes above, the workflow becomes mostly automatic. You stop reading the boilerplate and start reading the rules. That's where the bugs live.&lt;/p&gt;

&lt;p&gt;For the prompt set we use on GitLab CI specifically, see the &lt;a href="https://dev.to/categories/gitlab-cicd/"&gt;GitLab CI/CD category&lt;/a&gt; — particularly &lt;a href="https://dev.to/prompts/gitlab-pipeline-optimization/"&gt;gitlab-pipeline-optimization&lt;/a&gt; and &lt;a href="https://dev.to/prompts/gitlab-ci-rules-debugging/"&gt;gitlab-ci-rules-debugging&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/ai-for-gitlab-ci-authoring/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>gitlab</category>
      <category>cicd</category>
      <category>ai</category>
    </item>
    <item>
      <title>Humanizing Artificial Intelligence in DevOps Documentation: Making Runbooks Easier to Create and Use</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Fri, 19 Jun 2026 21:23:54 +0000</pubDate>
      <link>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-in-devops-documentation-making-runbooks-easier-to-create-and-use-4fl7</link>
      <guid>https://dev.to/jjoyneriv/humanizing-artificial-intelligence-in-devops-documentation-making-runbooks-easier-to-create-and-use-4fl7</guid>
      <description>&lt;h2&gt;
  
  
  The Runbook That Lied to Me at 3am
&lt;/h2&gt;

&lt;p&gt;The pager went off at 3:14am for a wedged OpenStack Neutron agent. I did what any tired engineer does: I opened the runbook. It told me to restart a service that had been renamed eighteen months earlier, pointed at a Grafana dashboard that 404'd, and assumed a network topology we'd migrated off of two quarters back. The runbook wasn't just unhelpful. It was actively lying to me, and I burned twenty minutes trusting it before I gave up and went to read the source.&lt;/p&gt;

&lt;p&gt;That's the real problem with documentation. It isn't that we don't write it. It's that the moment we finish writing it, it starts rotting, and the cost of keeping it fresh is high enough that nobody pays it until the document has already betrayed someone at 3am. A runbook your team doesn't trust is worse than no runbook, because no runbook at least forces you to think.&lt;/p&gt;

&lt;p&gt;This is where AI actually earns its keep in a platform org, and not in the way the marketing decks suggest. AI is not going to own your documentation. It's going to do the tedious first-draft labor — turning a resolved incident, a chunk of shell history, or a deploy diff into a structured skeleton — so a human engineer can spend their scarce attention on the part that matters: verifying the commands, marking what's unproven, and editing the robotic tone out so the team actually reads it. AI drafts. You verify and sign off. That distinction is the whole game.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "Humanizing" AI Is the Job, Not a Slogan
&lt;/h2&gt;

&lt;p&gt;Let me be precise about what I mean by "humanizing AI," because the phrase gets abused. I don't mean making AI sound human to fool a reader. I mean keeping a human in the loop as the editor and owner of record, and doing the unglamorous work of turning a competent-but-soulless machine draft into something a colleague trusts.&lt;/p&gt;

&lt;p&gt;Two things break trust in AI-drafted docs, and both are fixable by a human pass:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unverified claims stated with confidence.&lt;/strong&gt; An LLM will happily tell you to run &lt;code&gt;systemctl restart neutron-l3-agent&lt;/code&gt; whether or not that's the actual unit name on your boxes. It doesn't know. It's pattern-matching. So the human's first job is to run every command, in a safe environment, and confirm it does what the draft claims.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The robotic tone.&lt;/strong&gt; Machine drafts read like a compliance memo: hedged, repetitive, weirdly formal, full of "it is important to note that." Engineers smell that instantly and stop reading. Editing for voice and concision isn't vanity — a doc people skim past doesn't get used. That cleanup pass is legitimate, valuable work, and it's exactly the "humanizing" angle that matters.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you keep those two responsibilities firmly with a person, AI becomes a force multiplier instead of a liability generator. I've written more about the philosophy of &lt;a href="https://devopsaitoolkit.com/blog/building-incident-runbooks-engineers-trust-at-3am/" rel="noopener noreferrer"&gt;runbooks engineers actually trust at 3am&lt;/a&gt;, but the short version is: trust is built by verification, and verification is a human act.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting Guides: Draft From the Incident You Just Resolved
&lt;/h2&gt;

&lt;p&gt;The best time to write a troubleshooting guide is in the hour after you've fixed the thing, while the diagnostic path is still warm in your head. The worst time is never, which is the default. AI closes that gap because the raw material already exists in your terminal.&lt;/p&gt;

&lt;p&gt;Here's the prompt pattern I use. I dump my actual shell history and the incident timeline at the model and constrain it hard:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are drafting an internal troubleshooting guide for a platform team. I'll give you my shell history and a rough incident timeline. Produce a guide with these sections: &lt;strong&gt;Symptoms&lt;/strong&gt;, &lt;strong&gt;Prerequisites/Access&lt;/strong&gt;, &lt;strong&gt;Diagnosis Steps&lt;/strong&gt;, &lt;strong&gt;Resolution&lt;/strong&gt;, &lt;strong&gt;Rollback&lt;/strong&gt;, &lt;strong&gt;Verification&lt;/strong&gt;. Use the exact commands from my history — do not invent flags or paths. Wherever you infer something I didn't explicitly state, prefix that line with &lt;code&gt;[ASSUMPTION]&lt;/code&gt; so I can verify it. Keep the prose tight; no filler.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That &lt;code&gt;[ASSUMPTION]&lt;/code&gt; instruction is the single most important line in the prompt.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Always force the model to explicitly tag anything it couldn't verify from your inputs. An LLM's failure mode is confident gap-filling. When you make it mark its own guesses, you convert an invisible risk into a visible checklist — every &lt;code&gt;[ASSUMPTION]&lt;/code&gt; tag is a thing you personally confirm before the doc ships. The tags get deleted in editing, but they drive the verification pass.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Feed that prompt a &lt;code&gt;history | grep neutron&lt;/code&gt; plus a three-line timeline, and you get back something like this — which is a genuinely good starting point, not a finished doc:&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Diagnosis Steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Confirm the agent is actually down, not just flapping:
&lt;code&gt;openstack network agent list --agent-type l3 --host compute-04&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;[ASSUMPTION]&lt;/code&gt; The L3 agent unit is &lt;code&gt;neutron-l3-agent&lt;/code&gt;. Check status:
&lt;code&gt;systemctl status neutron-l3-agent&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tail the agent log for the RPC timeout signature:
&lt;code&gt;journalctl -u neutron-l3-agent --since "10 min ago" | grep -i timeout&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Resolution
&lt;/h2&gt;

&lt;p&gt;Restart the agent and confirm it re-registers with the server:&lt;br&gt;
&lt;code&gt;systemctl restart neutron-l3-agent &amp;amp;&amp;amp; sleep 5 &amp;amp;&amp;amp; openstack network agent list --host compute-04&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Rollback
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;[ASSUMPTION]&lt;/code&gt; No rollback required for a restart. If the agent fails to re-register, fail the router over to a standby node before deeper debugging.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Notice what the AI did well: it structured the chaos of my history into a clean diagnostic flow and pulled the real commands I ran. Notice what it flagged: it didn't know my unit name, so it tagged it. When I verified, the unit was actually &lt;code&gt;devstack@q-l3.service&lt;/code&gt; on the affected host — a thing I'd never have caught if the draft had just stated it confidently. That one assumption tag is the difference between a guide that helps and one that lies. I go deeper on this workflow in my piece on &lt;a href="https://devopsaitoolkit.com/blog/drafting-runbooks-from-resolved-incidents-with-ai/" rel="noopener noreferrer"&gt;drafting runbooks from resolved incidents with AI&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Postmortems: AI Handles the Timeline, You Own the Blamelessness
&lt;/h2&gt;

&lt;p&gt;Postmortems are where AI's drafting strength and its judgment weakness are both on full display. The mechanical parts — assembling a coherent timeline from Slack threads, alert timestamps, and deploy logs, then drafting an impact summary — are exactly the tedious work that delays postmortems for weeks. AI eats that for breakfast.&lt;/p&gt;

&lt;p&gt;But the &lt;em&gt;blameless&lt;/em&gt; part is not a tone setting you toggle. It's an editorial and cultural stance that a human has to own. I give the model the raw timeline and explicitly instruct it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Draft a blameless postmortem. Describe what the system did and what signals were available, never what a person "failed" to do. Frame every human action as a reasonable decision given the information available at the time. Sections: Summary, Impact, Timeline, Contributing Factors, What Went Well, Action Items. Mark any causal claim you can't support from the inputs with &lt;code&gt;[UNVERIFIED]&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model will get you 80% of the way to neutral language. The remaining 20% — catching the sentence that subtly implies the on-call engineer should have known better — is human work, every time. Blameless writing is a skill, and the editing pass is where you apply it. If you haven't internalized what separates a postmortem people read from one that gets filed and forgotten, &lt;a href="https://devopsaitoolkit.com/blog/how-to-write-a-blameless-postmortem-that-people-actually-read/" rel="noopener noreferrer"&gt;this breakdown of blameless postmortems people actually read&lt;/a&gt; is worth your time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pro Tip: Never let AI write your Action Items unsupervised. The model loves to generate plausible-sounding remediation ("add more monitoring," "improve documentation") that sounds responsible and commits no one to anything. Real action items have an owner, a due date, and a verifiable definition of done. That's a leadership decision, not a text-generation task — strike every vague item the draft produces and replace it with something a person actually agreed to do.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The payoff is real, though. When the timeline and impact summary are drafted in ten minutes instead of taking the better part of a day, postmortems actually get written while the details are fresh, instead of being abandoned because everyone moved on to the next fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  SOPs: Encode the Tribal Knowledge Before It Walks Out the Door
&lt;/h2&gt;

&lt;p&gt;Standard operating procedures are the documents most likely to live entirely in one senior engineer's head — how we rotate the cluster certs, how we drain a node for maintenance, the precise dance to cut a new Terraform workspace without orphaning state. That knowledge is a bus-factor risk, and writing it down has always lost to more urgent work.&lt;/p&gt;

&lt;p&gt;AI lowers the activation energy enough that it stops losing. I'll narrate the procedure out loud into a rough text file — half sentences, command fragments, the gotchas I remember — and have the model turn that mess into a structured SOP with numbered steps, prerequisites, and explicit "you are done when..." verification criteria.&lt;/p&gt;

&lt;p&gt;The thing to watch here is that an SOP encodes &lt;em&gt;policy&lt;/em&gt;, not just commands. The AI can format your steps beautifully and still produce something that violates your change-management rules because it doesn't know them. So the human pass on an SOP is checking two layers: are the commands correct, &lt;em&gt;and&lt;/em&gt; does the procedure comply with how we're actually supposed to operate? The model can't see your org chart or your change board. You can.&lt;/p&gt;

&lt;p&gt;I keep a small library of these drafting prompts so I'm not rewriting the scaffolding each time — collecting and reusing the &lt;a href="https://devopsaitoolkit.com/prompts/" rel="noopener noreferrer"&gt;prompts that work&lt;/a&gt; is half the productivity gain, because the quality of the draft is downstream of the quality of the prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment and Release Docs: Draft Straight From the Diff
&lt;/h2&gt;

&lt;p&gt;Deployment documentation has a unique advantage: the source of truth is structured and machine-readable. A PR diff, a Terraform plan, a Helm values change — these are precise artifacts an AI can read directly, which means the draft starts from facts rather than recollection.&lt;/p&gt;

&lt;p&gt;My workflow for release notes and deploy runbooks: pipe the diff or the &lt;code&gt;terraform plan&lt;/code&gt; output to the model and ask for a deployment guide that includes pre-flight checks, the apply procedure, the blast radius, the rollback, and the post-deploy verification. Because the input is concrete, the hallucination rate drops sharply. The model isn't guessing what changed — it's reading it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;h2&gt;
  
  
  Pre-flight Checks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Confirm the target workspace: &lt;code&gt;terraform workspace show&lt;/code&gt; returns &lt;code&gt;prod-us-east&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;[ASSUMPTION]&lt;/code&gt; This plan adds 2 nodes and modifies the ASG launch template; no destroys. Re-run &lt;code&gt;terraform plan&lt;/code&gt; and confirm zero resources to destroy before applying.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Rollback
&lt;/h2&gt;

&lt;p&gt;The launch template change is versioned. To roll back, point the ASG at the previous template version and trigger an instance refresh — do &lt;strong&gt;not&lt;/strong&gt; &lt;code&gt;terraform destroy&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even here, the human verifies the blast radius claim. "No destroys" is the kind of statement that's true until a provider upgrade quietly makes a field force-new, and a misread there is how you turn a routine deploy into an outage. The AI gets you a structured, mostly-correct draft fast; you confirm the dangerous parts with your own eyes. For a fuller treatment of wiring this into a real pipeline, my &lt;a href="https://devopsaitoolkit.com/blog/devops-runbook-automation-with-ai-2026-guide/" rel="noopener noreferrer"&gt;2026 guide to runbook automation with AI&lt;/a&gt; walks through the tooling end to end.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Editing Pass Is Where Trust Is Built
&lt;/h2&gt;

&lt;p&gt;I want to close on the part that gets skipped, because it's the part that actually decides whether your docs get used. Once the commands are verified and the assumptions are resolved, you still have a machine-shaped document on your hands — competent, structured, and slightly lifeless. It hedges too much. It repeats the obvious. It has that flat, over-formal register that makes an engineer's eyes glaze.&lt;/p&gt;

&lt;p&gt;Editing that out is not cosmetic. A runbook nobody can stand to read is a runbook nobody reads, which puts you right back at 3am reading source code. So I do a final pass for voice: cut the filler, sharpen the warnings, add the one line of context only a human who lived the incident can add ("the agent flaps for about thirty seconds after restart — don't panic and restart it again"). That sentence is worth more than the entire generated scaffold, and only a person can write it.&lt;/p&gt;

&lt;p&gt;That's the humanizing loop in full: AI drafts the structure from real artifacts, a human verifies every command and resolves every assumption, and then a human edits the robotic tone into something a tired colleague will actually trust and follow. Keep ownership with the person at every step and AI becomes the best documentation intern you've ever had — fast, tireless, and entirely supervised. Skip the human steps and you're just generating the next lie waiting to fire at 3am.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;James Joyner IV runs &lt;a href="https://devopsaitoolkit.com/work-with-me/" rel="noopener noreferrer"&gt;devopsaitoolkit.com&lt;/a&gt;, where he writes about running production OpenStack, Kubernetes, and observability with AI in the loop. If your AI-drafted docs read like a robot wrote them, his &lt;a href="https://devopsaitoolkit.com/prompt-packs/writing-humanizer/" rel="noopener noreferrer"&gt;Writing Humanizer pack&lt;/a&gt; is a toolkit for making machine drafts read like a human actually sat down and wrote them.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>documentation</category>
      <category>sre</category>
    </item>
    <item>
      <title>Securing AI-Generated Bash Scripts Before You Run Them</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 18 Jun 2026 15:51:56 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/securing-ai-generated-bash-scripts-before-you-run-them-401m</link>
      <guid>https://dev.to/devopsaitoolkit/securing-ai-generated-bash-scripts-before-you-run-them-401m</guid>
      <description>&lt;p&gt;Bash is the easiest language for AI to write and the easiest language to get devastating output from. A 20-line script that "just cleans up old files" can recursively delete a home directory because the model assumed a variable would always be set. A "simple log shipper" can write your secrets to a remote server because the model used &lt;code&gt;set -x&lt;/code&gt; for debugging and forgot to remove it.&lt;/p&gt;

&lt;p&gt;I have run AI-generated bash that I should not have. Most engineers I know have too. After enough close calls, there's a short checklist that catches the worst of it. This is that checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five things to check before running any AI-generated bash
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Does it start with a strict pragma?
&lt;/h3&gt;

&lt;p&gt;The first lines of any non-trivial bash script should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -e&lt;/code&gt;&lt;/strong&gt; — exit on any command failure. Without this, a failure in line 5 doesn't stop the script from happily running lines 6-50.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -u&lt;/code&gt;&lt;/strong&gt; — error on undefined variables. This is the one that saves you from &lt;code&gt;rm -rf $UNDEFINED/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -o pipefail&lt;/code&gt;&lt;/strong&gt; — propagate failures through pipes. Without it, &lt;code&gt;failing-command | grep something&lt;/code&gt; succeeds because grep succeeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IFS=$'\n\t'&lt;/code&gt;&lt;/strong&gt; — sane field splitting. Defends against word-splitting bugs in filenames.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the AI-generated script doesn't have these, add them and re-read the script. You'll often discover bugs the pragma now flags.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Is every variable expansion quoted?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wrong&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="nv"&gt;$TARGET_DIR&lt;/span&gt;

&lt;span class="c"&gt;# Right&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TARGET_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wrong version is what causes the "I deleted the root directory" stories. If &lt;code&gt;$TARGET_DIR&lt;/code&gt; is empty or contains a space, the command becomes &lt;code&gt;rm -rf&lt;/code&gt; (delete current directory) or &lt;code&gt;rm -rf foo bar&lt;/code&gt; (delete two unintended things).&lt;/p&gt;

&lt;p&gt;Models default to the wrong version about half the time because the right version is harder to write in chat ("escape the quotes!") and the wrong version is what most blogs show.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; When reading AI bash, mentally check every &lt;code&gt;$VAR&lt;/code&gt; for quotes. Add them if missing. This is the single biggest source of bash disasters.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. What happens if a step fails partway through?
&lt;/h3&gt;

&lt;p&gt;The AI will cheerfully write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/new-app
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/new-app
&lt;span class="nb"&gt;tar &lt;/span&gt;xzf &lt;span class="nv"&gt;$TARBALL&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nv"&gt;$TARBALL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens if &lt;code&gt;tar xzf&lt;/code&gt; fails (corrupt tarball, full disk)? With &lt;code&gt;set -e&lt;/code&gt;, the script stops. Good. Without &lt;code&gt;set -e&lt;/code&gt;, it continues to &lt;code&gt;rm $TARBALL&lt;/code&gt; and deletes your tarball with no backup.&lt;/p&gt;

&lt;p&gt;For any state-changing script, ask yourself: at each step, what's the recovery path if the step fails? If the answer is "nothing automated," the script should at least &lt;em&gt;not delete data&lt;/em&gt; before verifying the previous step succeeded.&lt;/p&gt;

&lt;p&gt;The AI almost never thinks about this on its own.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Are secrets visible in logs?
&lt;/h3&gt;

&lt;p&gt;The most common way AI-generated bash leaks secrets is via &lt;code&gt;set -x&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-x&lt;/span&gt;  &lt;span class="c"&gt;# debugging&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://api.example.com/...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;set -x&lt;/code&gt;, every command is printed including the expanded variables. Your API token is now in the script's output, which is in your CI logs, which are visible to anyone with project access.&lt;/p&gt;

&lt;p&gt;The fix is selective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; +x  &lt;span class="c"&gt;# disable trace&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://api.example.com/...
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-x&lt;/span&gt;  &lt;span class="c"&gt;# re-enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or simply remove &lt;code&gt;set -x&lt;/code&gt; once debugging is done. The model frequently leaves it in.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Does it run as root unnecessarily?
&lt;/h3&gt;

&lt;p&gt;The AI will sometimes write &lt;code&gt;sudo&lt;/code&gt; into every command, even ones that don't need it. Or it'll assume the script runs as root and use absolute paths that require root to write.&lt;/p&gt;

&lt;p&gt;The principle: if a command can run as a non-root user, it should. The smaller the privileged surface, the smaller the blast radius.&lt;/p&gt;

&lt;p&gt;This is especially important for scripts that download and execute code. A common pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dangerous: privileged download + execute&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'curl https://example.com/install.sh | bash'&lt;/span&gt;

&lt;span class="c"&gt;# Safer: review then run&lt;/span&gt;
curl https://example.com/install.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; install.sh
&lt;span class="c"&gt;# READ install.sh&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bash install.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the model generates the first pattern, replace it with the second. Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Last month I asked Claude to write a script that cleans up Docker images older than 30 days on a CI runner host. The first draft was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="nv"&gt;DOCKER_IMAGES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;docker images &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.ID}} {{.CreatedAt}}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;CUTOFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DOCKER_IMAGES&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;read &lt;/span&gt;ID DATE&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$CREATED&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="nv"&gt;$CUTOFF&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;docker rmi &lt;span class="nv"&gt;$ID&lt;/span&gt;
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Walking the checklist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No strict pragma.&lt;/strong&gt; Missing &lt;code&gt;set -euo pipefail&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unquoted &lt;code&gt;$DOCKER_IMAGES&lt;/code&gt;, &lt;code&gt;$ID&lt;/code&gt;, &lt;code&gt;$DATE&lt;/code&gt;.&lt;/strong&gt; Each one is a potential bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure handling.&lt;/strong&gt; &lt;code&gt;docker rmi&lt;/code&gt; fails if an image is in use. The script continues, marches through, and silently fails on every in-use image. We never know which were cleaned and which weren't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No secrets&lt;/strong&gt; (docker doesn't expose them here), but the script also doesn't log what it's doing, so you can't audit afterward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;sudo&lt;/code&gt;,&lt;/strong&gt; good — assumes the user has Docker socket access, which is reasonable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The hardened version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;

&lt;span class="nv"&gt;CUTOFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;REMOVED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="nv"&gt;SKIPPED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

&lt;span class="c"&gt;# Use --format with safer parsing&lt;/span&gt;
docker images &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.ID}}|{{.CreatedAt}}'&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'|'&lt;/span&gt; &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; ID DATE&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CREATED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CUTOFF&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        if &lt;/span&gt;docker rmi &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Removed: &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
            &lt;span class="nv"&gt;REMOVED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;REMOVED &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Skipped (in use): &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
            &lt;span class="nv"&gt;SKIPPED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;SKIPPED &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;fi
    fi
done

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Cleanup complete. Removed: &lt;/span&gt;&lt;span class="nv"&gt;$REMOVED&lt;/span&gt;&lt;span class="s2"&gt;, Skipped: &lt;/span&gt;&lt;span class="nv"&gt;$SKIPPED&lt;/span&gt;&lt;span class="s2"&gt;."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This took two minutes of editing. Without the checklist, I might have run the original and noticed days later that disk usage hadn't really dropped because half the images were in use.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small note on bash linting
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;shellcheck&lt;/code&gt; catches most of these issues automatically. If you adopt one tool from this article, make it shellcheck:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;shellcheck cleanup-images.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will flag unquoted variables, missing strict mode, and a dozen other patterns. AI-generated bash usually has at least one shellcheck warning.&lt;/p&gt;

&lt;p&gt;I now run shellcheck on every script before I run the script itself. It's two seconds and catches things I'd miss.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the AI gets it right
&lt;/h2&gt;

&lt;p&gt;To be fair: the model is often perfectly capable of producing safe bash. If you prompt it explicitly — "write this with &lt;code&gt;set -euo pipefail&lt;/code&gt;, quote every variable, fail loudly on errors" — you'll get a clean script.&lt;/p&gt;

&lt;p&gt;The problem is that "write me a script that does X" without that prompt gets you the &lt;em&gt;common&lt;/em&gt; form of the script, which is the unsafe form. So the rule of thumb:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always include the safety requirements in the prompt.&lt;/strong&gt; Or: always treat the output as a draft that needs hardening. Don't run any bash the AI wrote without one of those two disciplines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Bash from AI is fast to produce and easy to read incorrectly. The checklist is short — strict pragma, quoted expansions, failure paths, secrets in logs, unnecessary privilege — and applying it takes a couple of minutes per script. The downside of skipping it is on the spectrum of "minor cleanup mistake" to "career incident." There's no excuse not to do the check.&lt;/p&gt;

&lt;p&gt;For our prompts on bash specifically, see &lt;a href="https://dev.to/prompts/bash-script-code-review/"&gt;bash-script-code-review&lt;/a&gt; and the related &lt;a href="https://dev.to/prompts/linux-server-hardening/"&gt;linux-server-hardening&lt;/a&gt; prompt — both of which cover related territory.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/securing-ai-generated-bash-scripts/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>bash</category>
      <category>security</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Best AI Tools for DevOps Engineers in 2026</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Wed, 17 Jun 2026 20:59:44 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/the-best-ai-tools-for-devops-engineers-in-2026-15a9</link>
      <guid>https://dev.to/devopsaitoolkit/the-best-ai-tools-for-devops-engineers-in-2026-15a9</guid>
      <description>&lt;p&gt;If you spend your day in a terminal, a YAML editor, or a Grafana tab — AI assistants in 2026 are no longer a curiosity. They're a real productivity layer. But not every tool is good at infrastructure work. After a year of daily use across Linux administration, OpenStack operations, Prometheus alert authoring, and Kubernetes debugging, here's the honest shortlist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The criteria
&lt;/h2&gt;

&lt;p&gt;We're not ranking on benchmark scores. We're ranking on &lt;strong&gt;infrastructure usefulness&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning over command output&lt;/strong&gt; — can it actually read &lt;code&gt;top&lt;/code&gt;, &lt;code&gt;kubectl describe&lt;/code&gt;, or &lt;code&gt;journalctl&lt;/code&gt; and find the real problem?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt; — does it warn before suggesting destructive commands?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long context&lt;/strong&gt; — can it hold a 1,000-line &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; plus failing logs without losing track?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terminal integration&lt;/strong&gt; — can you use it without leaving your workflow?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy and self-host options&lt;/strong&gt; — for the engineers whose employers care.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The shortlist
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Claude (Anthropic)
&lt;/h3&gt;

&lt;p&gt;The current best general assistant for infrastructure reasoning. Long context handles enormous log dumps and Kubernetes manifests in one shot. It is consistently more cautious about destructive commands than alternatives — which matters when you're tired at 2am and tempted to copy-paste straight into prod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Linux/OpenStack/Kubernetes troubleshooting, postmortem drafting, code review on infrastructure-as-code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. ChatGPT (OpenAI)
&lt;/h3&gt;

&lt;p&gt;The broadest ecosystem. Strong code generation, plug-in support, and the largest community of shared prompts and patterns. For Ansible and Terraform generation, output quality is excellent. Slightly less cautious by default — you'll want to add safety constraints in your prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Ansible/Terraform generation, ad-hoc scripting, learning new tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cursor
&lt;/h3&gt;

&lt;p&gt;If you live in an IDE, Cursor is what your IDE should have been. Native multi-file context, agent mode for repo-wide refactors, and tab-completion that actually understands your codebase. Especially strong for IaC repositories with many interconnected files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Editing real codebases (Helm charts, Terraform modules, Python operators).&lt;/p&gt;

&lt;h3&gt;
  
  
  4. GitHub Copilot
&lt;/h3&gt;

&lt;p&gt;The lowest-friction option. Inline completion just works, and the chat sidebar is genuinely useful for "explain this regex" or "what's this PromQL doing?" If your org already pays for GitHub, Copilot is essentially free upside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Inline completion while editing YAML, Bash, Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Warp Terminal (with AI features)
&lt;/h3&gt;

&lt;p&gt;The only entry on this list that isn't an AI assistant per se — it's a terminal that has AI built in. The killer feature: natural-language command suggestions in your shell, with safety previews. For Linux admins who don't want to alt-tab to a chat window every five seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Terminal-native workflows where context-switching kills focus.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we don't recommend (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generic LLM wrappers that promise "DevOps AI."&lt;/strong&gt; Most are thin layers over the same APIs above, sometimes with worse safety defaults. Use the underlying tools directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything that requires uploading your &lt;code&gt;~/.ssh&lt;/code&gt; directory or production credentials.&lt;/strong&gt; Be skeptical of "AI agents that run commands for you" without a clear sandbox model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to combine them
&lt;/h2&gt;

&lt;p&gt;A pattern that works well in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Claude or ChatGPT in a browser&lt;/strong&gt; for deep diagnosis sessions (paste logs, walk through hypotheses, draft postmortems).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cursor or Copilot in your editor&lt;/strong&gt; for actually writing the fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warp&lt;/strong&gt; in the terminal for quick command lookups without switching context.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need one perfect tool. You need a workflow where each tool plays to its strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/claude-linux-troubleshooting/"&gt;How to Use Claude to Troubleshoot Linux Servers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/chatgpt-vs-claude-for-infrastructure/"&gt;ChatGPT vs Claude for Infrastructure Engineers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/best-ai-tools-for-devops-engineers/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>tools</category>
      <category>claude</category>
    </item>
  </channel>
</rss>
