<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevOps AI ToolKit</title>
    <description>The latest articles on DEV Community by DevOps AI ToolKit (devopsaitoolkit).</description>
    <link>https://dev.to/devopsaitoolkit</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F13604%2F98511a3b-3821-49c1-b918-dca37eae0c17.png</url>
      <title>DEV Community: DevOps AI ToolKit</title>
      <link>https://dev.to/devopsaitoolkit</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devopsaitoolkit"/>
    <language>en</language>
    <item>
      <title>DevOps as a Service Pricing: What Should Businesses Expect to Pay?</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Mon, 29 Jun 2026 22:02:07 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/devops-as-a-service-pricing-what-should-businesses-expect-to-pay-2481</link>
      <guid>https://dev.to/devopsaitoolkit/devops-as-a-service-pricing-what-should-businesses-expect-to-pay-2481</guid>
      <description>&lt;p&gt;After 25 years of keeping production systems alive — building the automation, owning the pager, and helping companies stop bleeding money on preventable outages — the question I get asked most by founders and operations leads is blunt: &lt;em&gt;"What is this going to cost me?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The honest answer is the one nobody likes: it depends. But "it depends" isn't useful if you're trying to budget. So let me give you the real version — what drives the number, the pricing models you'll actually be quoted, and a simple way to figure out whether the spend pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DevOps pricing varies so much
&lt;/h2&gt;

&lt;p&gt;There's no sticker price on DevOps for the same reason there's no sticker price on "fixing my house." A one-bedroom condo and a 40-year-old farmhouse are different jobs. Three things move the number more than anything else:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Company size.&lt;/strong&gt; A two-person startup with one Linux server and a single web app is a fundamentally different engagement than a 200-person company running multiple Kubernetes clusters across regions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure complexity.&lt;/strong&gt; A static site on a single cloud VM is cheap to run. A microservices platform with service meshes, multiple databases, message queues, and compliance requirements is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support expectations.&lt;/strong&gt; "Help us when something breaks during business hours" and "24/7 on-call with a 15-minute response SLA" are priced an order of magnitude apart, because one of them owns someone's nights and weekends.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before you can compare quotes, you have to be honest about which of those buckets you're actually in. A provider quoting you a low number may simply be assuming a smaller scope than the one you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  The common pricing models
&lt;/h2&gt;

&lt;p&gt;Most DevOps as a Service work is sold under one of five models. Each fits a different situation, and good providers will steer you toward the right one rather than forcing everything into their favorite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hourly / time-and-materials
&lt;/h3&gt;

&lt;p&gt;You pay for hours worked, usually billed against a monthly cap.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; Small, well-defined tasks, ad-hoc help, or an early relationship where neither side knows the full scope yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rough ballpark:&lt;/strong&gt; Rates vary widely by region and seniority. The trap is that hourly incentivizes activity, not outcomes — a cheap hourly rate from someone who takes three times as long is not a bargain.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Monthly retainer
&lt;/h3&gt;

&lt;p&gt;A fixed monthly fee buys you a block of capacity and ongoing ownership of your infrastructure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; You have living infrastructure that needs continuous care — patching, monitoring, upgrades, small improvements — and you want a predictable line item.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Ongoing Kubernetes version upgrades, Prometheus and Grafana tuning, and routine Ansible-driven patching of your Linux fleet are classic retainer work. The cluster doesn't stop needing attention, so neither does the engagement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Project-based / fixed bid
&lt;/h3&gt;

&lt;p&gt;A scoped deliverable for a fixed price.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; A clear, bounded build with a defined "done."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; A one-time Terraform plus GitLab CI/CD build-out — provision the cloud accounts, write the infrastructure as code, stand up the pipelines, Dockerize the apps, and hand it over — is naturally project-priced. You know what you're getting and what it costs before work starts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Emergency / incident support
&lt;/h3&gt;

&lt;p&gt;On-demand help when production is on fire, often at a premium rate or via a pre-paid response retainer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; You run your own systems day-to-day but want a number to call when something serious breaks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reality check:&lt;/strong&gt; This is the most expensive way to buy help per hour, because you're paying for someone to drop everything. It's insurance, not a maintenance plan — and it's far cheaper to prevent the incident than to buy emergency labor mid-outage.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fully managed service
&lt;/h3&gt;

&lt;p&gt;The provider owns your DevOps function end to end — infrastructure, pipelines, monitoring, security, on-call, the lot.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;When it fits:&lt;/strong&gt; You don't want to hire and retain an internal platform team, or you want to extend the small one you have.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reality check:&lt;/strong&gt; This is the highest monthly spend, but compare it against the loaded cost of hiring senior engineers, the recruiting time, and the bus-factor risk of a one-person internal team. Often it's cheaper and far less fragile than building the same capability in-house.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A healthy engagement often mixes models: a project-priced initial build-out, then a monthly retainer to run what was built.&lt;/p&gt;

&lt;h2&gt;
  
  
  What services actually move the price
&lt;/h2&gt;

&lt;p&gt;Within any model, the scope of work is what sets the number. The big cost factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud setup and infrastructure as code.&lt;/strong&gt; Account structure, networking, and Terraform modules to make it all reproducible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD pipelines.&lt;/strong&gt; Building and maintaining GitLab CI/CD (or equivalent) so deploys are fast, repeatable, and safe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containers and orchestration.&lt;/strong&gt; Docker images, registries, and Kubernetes — the single biggest complexity multiplier in modern infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring and observability.&lt;/strong&gt; Prometheus, Grafana, alerting rules, and dashboards. Good &lt;a href="https://dev.to/dashboard/monitoring-alerts/"&gt;monitoring and alert generation&lt;/a&gt; is what turns a 3am outage into a 9am ticket.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security.&lt;/strong&gt; Secrets management, access control, network policy, vulnerability scanning, and hardening of your Linux servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backups and disaster recovery.&lt;/strong&gt; Tested restores — not just backups that exist on paper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response and on-call.&lt;/strong&gt; The cost of someone being awake and accountable when things go wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation.&lt;/strong&gt; Ansible playbooks and scripting that replace manual, error-prone toil.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance.&lt;/strong&gt; SOC 2, HIPAA, PCI, and friends add audit, documentation, and control work that materially raises cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The more of these you need, and the higher the stakes, the higher the price. That's not padding — it's the actual work of keeping a real system running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why cheaper is not always better
&lt;/h2&gt;

&lt;p&gt;Here's where my experience makes me opinionated: in production infrastructure, the cheapest quote is frequently the most expensive decision.&lt;/p&gt;

&lt;p&gt;A low bid usually means one of a few things — a junior engineer learning on your dime, a scope that quietly excludes monitoring or backups, or a contractor who'll bolt something together and disappear before the technical debt comes due. You don't find out until the pipeline breaks at the worst possible moment, the backups turn out to be untested, or a security gap becomes an incident.&lt;/p&gt;

&lt;p&gt;Infrastructure is one of those areas where you're not buying hours — you're buying the absence of disasters. That's hard to see on an invoice and very easy to feel in an outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What downtime actually costs
&lt;/h2&gt;

&lt;p&gt;This is the framing that changes the conversation. Put a number on downtime and the "expensive" DevOps quote suddenly looks like a rounding error.&lt;/p&gt;

&lt;p&gt;A simple cost-of-downtime model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Downtime cost per hour = (Annual revenue / Business hours per year) + recovery labor + reputation/churn cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Work a concrete example. Say a business does &lt;strong&gt;$5,000,000&lt;/strong&gt; in revenue a year and runs roughly &lt;strong&gt;3,000 business hours&lt;/strong&gt;. That's about &lt;strong&gt;$1,667 per hour&lt;/strong&gt; in direct lost revenue — before you add the engineers pulled off roadmap work to firefight, the customers who churn, and the support load from a public incident. Call it &lt;strong&gt;$2,500–$4,000 an hour&lt;/strong&gt;, conservatively.&lt;/p&gt;

&lt;p&gt;Now consider what causes that downtime in shops without proper DevOps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failed deployments&lt;/strong&gt; with no pipeline safeguards or rollback — a bad release that takes hours to unwind.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Poor monitoring&lt;/strong&gt; that means you learn about the outage from angry customers instead of an alert, adding 30+ minutes of pure detection delay to every incident.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual, undocumented processes&lt;/strong&gt; where only one person knows how to restore the service, and they're on vacation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single multi-hour outage can cost more than a year of competent monitoring and incident-response coverage. The DevOps spend isn't competing with zero — it's competing with the outages it prevents.&lt;/p&gt;

&lt;h2&gt;
  
  
  How AI changes the math
&lt;/h2&gt;

&lt;p&gt;Part of why DevOps value-for-money has improved is that AI now removes a large slice of the repetitive labor that used to fill the bill.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drafting and reviewing infrastructure as code.&lt;/strong&gt; Terraform and Ansible scaffolding that used to take hours gets drafted in minutes, then reviewed by a human.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline and config generation.&lt;/strong&gt; GitLab CI/CD configs, Dockerfiles, and Kubernetes manifests start from a solid AI-generated baseline instead of a blank file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring setup.&lt;/strong&gt; Generating sensible Prometheus alert rules and Grafana panels — historically tedious, easily templated work — is far faster with AI assistance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident triage.&lt;/strong&gt; Summarizing logs and correlating "what changed" compresses the slow part of an outage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key word is &lt;em&gt;assisted&lt;/em&gt; — a human still owns every change to production. But a provider using AI well can deliver more per dollar, which means you get broader coverage for the same budget. If you want to see the kind of work this accelerates, our &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; shows the patterns we lean on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Starting lean: startups and small businesses
&lt;/h2&gt;

&lt;p&gt;If you're early-stage, you do not need a fully managed enterprise engagement, and you shouldn't pay for one. Start with a lean package that covers the essentials and nothing you won't use yet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A reproducible cloud setup with Terraform, so you're never clicking around a console by hand.&lt;/li&gt;
&lt;li&gt;One clean CI/CD pipeline so deploys are boring and repeatable.&lt;/li&gt;
&lt;li&gt;Basic monitoring and alerting on the handful of metrics that actually predict outages.&lt;/li&gt;
&lt;li&gt;Tested backups.&lt;/li&gt;
&lt;li&gt;A documented runbook so recovery doesn't depend on one person's memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a modest retainer or a small fixed-bid build-out, and it removes the failure modes that sink small companies. You add Kubernetes, deeper observability, and compliance work later — when the business actually needs them, not before. You can see how we structure tiers like this on our &lt;a href="https://dev.to/pricing"&gt;pricing&lt;/a&gt; page.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to calculate ROI
&lt;/h2&gt;

&lt;p&gt;Don't buy DevOps on vibes. Run the numbers. A usable formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROI (%) = ((Value gained - Cost of service) / Cost of service) x 100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;strong&gt;value gained&lt;/strong&gt; is the sum of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Downtime avoided&lt;/strong&gt; — fewer outage hours × your cost-of-downtime-per-hour.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Engineering time reclaimed&lt;/strong&gt; — hours your developers stop spending on infrastructure toil, at their loaded cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Faster delivery&lt;/strong&gt; — features shipped sooner because the pipeline is fast and reliable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incidents prevented&lt;/strong&gt; — the emergency-rate firefighting you never have to buy.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A worked example. Suppose a managed engagement costs &lt;strong&gt;$60,000 a year&lt;/strong&gt;. Over that year it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevents an estimated &lt;strong&gt;20 hours&lt;/strong&gt; of downtime at &lt;strong&gt;$3,000/hour&lt;/strong&gt; = &lt;strong&gt;$60,000&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Frees &lt;strong&gt;two developers&lt;/strong&gt; from ~5 hours/week of infra work — roughly &lt;strong&gt;$50,000&lt;/strong&gt; of reclaimed engineering time.&lt;/li&gt;
&lt;li&gt;Speeds delivery enough to pull in revenue you'd otherwise have deferred — call it &lt;strong&gt;$30,000&lt;/strong&gt;, conservatively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's &lt;strong&gt;$140,000&lt;/strong&gt; of value against &lt;strong&gt;$60,000&lt;/strong&gt; of cost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROI = (($140,000 - $60,000) / $60,000) x 100 = ~133%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even if you halve every one of those estimates to be safe, you're still solidly positive. The exercise matters more than the exact figures — when you actually price the downtime you avoid and the time you reclaim, good DevOps consistently pays for itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;DevOps as a Service pricing genuinely varies, and any provider who hands you a flat number without understanding your systems is guessing. But the framework is straightforward: know which size and complexity bucket you're in, pick the pricing model that fits the work, scope the services you actually need, and run the ROI math against the very real cost of doing nothing.&lt;/p&gt;

&lt;p&gt;The mistake I see most often is treating DevOps as a cost line to minimize. It isn't. It's an investment in uptime, delivery speed, security, and the ability to scale without setting your infrastructure on fire. Price it against the outages, the lost engineering hours, and the deals you can't close because the platform won't hold — and the question stops being "what does this cost?" and becomes "what is it costing me not to have it?"&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Cost figures and ranges here are illustrative. Build your own estimate from your real revenue, infrastructure, and risk profile before committing to a budget.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/devops-as-a-service-pricing-what-to-expect/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>pricing</category>
      <category>manageddevops</category>
      <category>roi</category>
    </item>
    <item>
      <title>Best DevSecOps Security Tools for CI/CD Pipeline Protection</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sun, 28 Jun 2026 05:05:55 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/best-devsecops-security-tools-for-cicd-pipeline-protection-25ff</link>
      <guid>https://dev.to/devopsaitoolkit/best-devsecops-security-tools-for-cicd-pipeline-protection-25ff</guid>
      <description>&lt;p&gt;I've spent twenty-five years building and securing deployment pipelines, and the single biggest shift in that time isn't a tool — it's &lt;em&gt;where&lt;/em&gt; security lives. We used to bolt it on at the end, right before a release, when changing anything was expensive and everyone was already tired. That's backwards. DevSecOps is the correction: you move security checks left, into the pipeline, so problems surface when they're cheap to fix and the person who introduced them is still looking at the code.&lt;/p&gt;

&lt;p&gt;This is a tour of the tool &lt;em&gt;categories&lt;/em&gt; that matter, with representative (mostly open-source) examples and where each one fits in a real GitLab CI/CD or GitHub Actions pipeline. It is not a ranked ad. The right toolchain depends on your team size and how mature your infrastructure is, and I'll come back to that at the end.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DevSecOps actually means
&lt;/h2&gt;

&lt;p&gt;DevSecOps is "shift-left security": treating security as a property of the pipeline, not a gate at the end of it. Concretely, it means your CI runs the same checks a security reviewer would — scanning code, dependencies, containers, infrastructure definitions, and secrets — &lt;em&gt;automatically, on every push&lt;/em&gt;, and fails the build when it finds something that genuinely matters.&lt;/p&gt;

&lt;p&gt;The goal isn't to drown developers in findings. It's to catch the dangerous classes of mistakes early and consistently, so security becomes muscle memory instead of a quarterly fire drill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the CI/CD pipeline is a prime target
&lt;/h2&gt;

&lt;p&gt;Your pipeline is the most privileged thing in your engineering org and the least watched. It has credentials to your cloud, your registry, and production. That makes it a high-value target in ways teams routinely underestimate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Supply-chain attacks.&lt;/strong&gt; A compromised dependency or a malicious GitHub Action runs &lt;em&gt;inside&lt;/em&gt; your build with your secrets in the environment. SolarWinds and the Codecov breach were both pipeline-level compromises, not application bugs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secrets sprawl.&lt;/strong&gt; Pipelines are where API keys, cloud credentials, and registry tokens live. A leaked CI variable or a key hardcoded in a &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; is a direct path to your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runner trust.&lt;/strong&gt; Shared or self-hosted runners that build untrusted code can be poisoned. A pull request from a fork that triggers a privileged job is a classic foothold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifact tampering.&lt;/strong&gt; If nothing verifies that the image you deploy is the image you built, an attacker who can write to your registry can swap it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every category below is a defense against one or more of these.&lt;/p&gt;

&lt;h2&gt;
  
  
  SAST — Static Application Security Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans your source code without running it, looking for vulnerable patterns — SQL injection, command injection, unsafe deserialization, hardcoded crypto mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; early, on every merge request, as a fast job that runs before anything gets built.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://semgrep.dev" rel="noopener noreferrer"&gt;Semgrep&lt;/a&gt; is my default — it's open-source, fast, and its rules read like the code they match, so writing a custom rule for your own footguns takes minutes. GitLab ships a built-in &lt;a href="https://docs.gitlab.com/ee/user/application_security/sast/" rel="noopener noreferrer"&gt;SAST&lt;/a&gt; analyzer you can enable with a single &lt;code&gt;include&lt;/code&gt; in your &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;. For Python-specific work, Bandit is a lightweight option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; run SAST in &lt;em&gt;diff&lt;/em&gt; mode on merge requests so it only flags new findings. Nothing kills adoption faster than a first run that reports 4,000 pre-existing issues and blocks everyone. Baseline the legacy debt, enforce on new code.&lt;/p&gt;

&lt;h2&gt;
  
  
  DAST — Dynamic Application Security Testing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; attacks a &lt;em&gt;running&lt;/em&gt; instance of your app the way an external scanner would — probing for XSS, injection, misconfigured headers, and exposed endpoints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; later in the pipeline, after you've deployed the build to an ephemeral staging environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://www.zaproxy.org" rel="noopener noreferrer"&gt;OWASP ZAP&lt;/a&gt; is the open-source standard. Its baseline scan runs cleanly as a Docker container in a CI job — point it at your staging URL and it produces a report you can fail the pipeline on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; DAST is slower and noisier than SAST, so don't gate every merge request on a full active scan. Run a quick ZAP baseline scan on merge requests and a deeper scan nightly against staging.&lt;/p&gt;

&lt;h2&gt;
  
  
  SCA / dependency scanning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; Software Composition Analysis inventories your third-party dependencies and flags ones with known CVEs. Most of the code in your app isn't yours, and this is where most exploitable vulnerabilities actually live.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; on every build, and continuously in the background as new CVEs are published against dependencies you already shipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://trivy.dev" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt; and &lt;a href="https://github.com/anchore/grype" rel="noopener noreferrer"&gt;Grype&lt;/a&gt; both scan lockfiles and dependency manifests for vulnerabilities, open-source and fast. For &lt;em&gt;fixing&lt;/em&gt; — not just finding — &lt;a href="https://docs.github.com/en/code-security/dependabot" rel="noopener noreferrer"&gt;Dependabot&lt;/a&gt; (GitHub-native) and &lt;a href="https://docs.renovatebot.com" rel="noopener noreferrer"&gt;Renovate&lt;/a&gt; (works everywhere, including GitLab) open automated PRs that bump vulnerable packages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; pair scanning with automated updates. Trivy tells you you're exposed; Renovate is what closes the gap before it rots into a quarter-long upgrade project. A scanner with no remediation path just generates anxiety.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container image scanning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans the layers of a built Docker image — OS packages and language dependencies baked into it — for known vulnerabilities. The friendly &lt;code&gt;node:18&lt;/code&gt; base image you pulled six months ago has accumulated CVEs since.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; right after you build the image, before you push it to your registry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://trivy.dev" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt; again (it does both SCA and image scanning, which is why it's so widely deployed), &lt;a href="https://github.com/anchore/grype" rel="noopener noreferrer"&gt;Grype&lt;/a&gt;, and &lt;a href="https://github.com/quay/clair" rel="noopener noreferrer"&gt;Clair&lt;/a&gt;, which underpins several registries' built-in scanning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; scan &lt;em&gt;before&lt;/em&gt; the push and gate the push on the result. A concrete GitLab CI pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;container_scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scan&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasec/trivy:latest&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;trivy image --exit-code 1 --severity CRITICAL,HIGH "$IMAGE_TAG"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--exit-code 1&lt;/code&gt; makes the job — and therefore the pipeline — fail on a critical finding, so a vulnerable image never reaches the registry in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code scanning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans your Terraform, Ansible, CloudFormation, and Kubernetes manifests for insecure configuration &lt;em&gt;before&lt;/em&gt; it provisions anything — public S3 buckets, security groups open to &lt;code&gt;0.0.0.0/0&lt;/code&gt;, unencrypted volumes, over-permissive IAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; in the plan/validate stage, before &lt;code&gt;terraform apply&lt;/code&gt; or &lt;code&gt;ansible-playbook&lt;/code&gt; runs against real infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://www.checkov.io" rel="noopener noreferrer"&gt;Checkov&lt;/a&gt; is the broadest — it covers Terraform, Ansible, Kubernetes, and more out of the box. &lt;a href="https://github.com/aquasecurity/tfsec" rel="noopener noreferrer"&gt;tfsec&lt;/a&gt; is Terraform-focused and fast (now converging with Trivy). &lt;a href="https://kics.io" rel="noopener noreferrer"&gt;KICS&lt;/a&gt; covers a wide spread of IaC formats.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; scan the Terraform &lt;em&gt;plan&lt;/em&gt;, not just the source, so you catch issues in the resolved configuration. In a GitHub Actions step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkov on Terraform plan&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;terraform plan -out=tfplan.binary&lt;/span&gt;
    &lt;span class="s"&gt;terraform show -json tfplan.binary &amp;gt; tfplan.json&lt;/span&gt;
    &lt;span class="s"&gt;checkov -f tfplan.json --quiet&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches the misconfiguration before it becomes a public bucket someone finds for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secrets detection
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; scans commits, history, and CI config for leaked credentials — AWS keys, tokens, private keys, passwords. This is the highest-leverage category for the effort involved, because a single leaked cloud key can cost you the whole environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; everywhere — as a pre-commit hook on the developer's machine &lt;em&gt;and&lt;/em&gt; as a CI job that scans the full diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://github.com/gitleaks/gitleaks" rel="noopener noreferrer"&gt;Gitleaks&lt;/a&gt; and &lt;a href="https://github.com/trufflesecurity/trufflehog" rel="noopener noreferrer"&gt;TruffleHog&lt;/a&gt; are the open-source workhorses. Run both through the &lt;a href="https://pre-commit.com" rel="noopener noreferrer"&gt;pre-commit&lt;/a&gt; framework so secrets get caught &lt;em&gt;before&lt;/em&gt; they ever hit the remote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; defense in depth. The pre-commit hook stops the honest mistake locally; the CI job catches the developer who skipped the hook with &lt;code&gt;--no-verify&lt;/code&gt;. A combined Trivy-plus-Gitleaks GitLab job that fails the pipeline on a critical finding is one of the cheapest, highest-value things you can add this week.&lt;/p&gt;

&lt;h2&gt;
  
  
  Policy-as-code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; lets you encode organizational rules as machine-checkable policy — "no container runs as root," "every image must come from our approved registry," "no resource without a cost-center tag" — and enforces them automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; two places. In CI, to validate manifests before merge. And at the Kubernetes admission layer, to block non-compliant workloads at deploy time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://www.openpolicyagent.org" rel="noopener noreferrer"&gt;OPA&lt;/a&gt; with &lt;a href="https://www.conftest.dev" rel="noopener noreferrer"&gt;Conftest&lt;/a&gt; for testing config files in CI (Terraform plans, Kubernetes YAML, Dockerfiles) against Rego policies. &lt;a href="https://kyverno.io" rel="noopener noreferrer"&gt;Kyverno&lt;/a&gt; for Kubernetes admission control, where policies are written as Kubernetes resources rather than a separate language — a gentler on-ramp for teams already fluent in YAML.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; policy-as-code is how you turn "we agreed in a meeting that prod images must be signed" into something the cluster &lt;em&gt;enforces&lt;/em&gt;. Start with one or two high-value policies in audit mode, watch what they'd block, then flip to enforce.&lt;/p&gt;

&lt;p&gt;This is also the layer that defends against artifact tampering. Generate a signature at build time with &lt;a href="https://docs.sigstore.dev/cosign/overview/" rel="noopener noreferrer"&gt;cosign&lt;/a&gt; and gate the registry push — and the cluster admission — behind a valid signature. If the image isn't the one you built and signed, it doesn't deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime security monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt; watches running workloads for suspicious behavior — a container spawning a shell, an unexpected outbound connection, a write to a sensitive path. This is the backstop for everything your pre-deploy scans missed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it fits:&lt;/strong&gt; in production, continuously, outside the pipeline — but it closes the DevSecOps loop by feeding real attack signals back to the people who build the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Representative tools:&lt;/strong&gt; &lt;a href="https://falco.org" rel="noopener noreferrer"&gt;Falco&lt;/a&gt; is the open-source standard, using eBPF to observe kernel-level syscalls with minimal overhead and alert on rule violations. The broader eBPF tooling ecosystem (Cilium, Tetragon) extends this into network policy and deep observability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical tip:&lt;/strong&gt; runtime security is your evidence that shift-left isn't perfect — and it never is. Treat a Falco alert as both an incident &lt;em&gt;and&lt;/em&gt; a signal to add a check earlier in the pipeline so the same class of thing gets caught before deploy next time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to choose: team size and infrastructure maturity
&lt;/h2&gt;

&lt;p&gt;You do not need all of the above on day one. Coverage you can't maintain is worse than honest gaps, because it breeds alert fatigue and trains your team to click past warnings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lean startup / small team.&lt;/strong&gt; Pick the three highest-leverage, lowest-friction tools and run them as failing CI jobs: secrets detection (Gitleaks), dependency + image scanning (Trivy), and IaC scanning (Checkov) if you're running Terraform. That's maybe an afternoon of pipeline work and it eliminates the most common ways small teams get owned. Add Dependabot or Renovate so you're patching, not just panicking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mid-size, growing team.&lt;/strong&gt; Layer in SAST in diff mode, DAST against staging, and image signing with cosign. Start introducing policy-as-code in audit mode so you understand your real posture before you enforce.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mature org with regulatory pressure.&lt;/strong&gt; Now the full stack earns its keep: enforced policy-as-code at admission, runtime monitoring with Falco, signed-and-verified artifacts end to end, and centralized reporting so security can see across teams. At this scale the integration and dashboards matter as much as the scanners.&lt;/p&gt;

&lt;p&gt;The pattern is consistent: each category, run as a job that &lt;em&gt;fails fast on real risk&lt;/em&gt;, beats a fancy tool that produces a report nobody reads. If you want to go deeper on pipeline patterns, our &lt;a href="https://dev.to/categories/gitlab-cicd/"&gt;GitLab CI/CD prompts&lt;/a&gt; cover a lot of this ground.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI fits — assistive, not authoritative
&lt;/h2&gt;

&lt;p&gt;Modern security tooling generates a &lt;em&gt;lot&lt;/em&gt; of output, and this is where AI genuinely earns its place. It's good at the reading and summarizing that wears humans down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triaging scan output.&lt;/strong&gt; Paste a wall of Trivy or Semgrep findings and ask the model to group them, identify which CVEs are actually reachable in your code path, and rank by real exploitability. A list of 200 "HIGH" findings becomes "these 6 matter, here's why."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explaining a finding.&lt;/strong&gt; "What does this Checkov failure mean and what's the minimal Terraform change to fix it?" turns a cryptic policy ID into an actionable diff.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drafting remediation.&lt;/strong&gt; Generate the patched dependency version, the hardened Dockerfile, or the corrected security-group block — as a &lt;em&gt;starting point&lt;/em&gt; you review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reading pipeline logs.&lt;/strong&gt; Summarizing a failed, noisy CI job to find the one line that actually broke the build.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The non-negotiable rule, same as during an incident: &lt;strong&gt;AI summarizes and suggests; a human verifies and applies.&lt;/strong&gt; A model will confidently mislabel a vulnerability's severity or propose a "fix" that doesn't compile. Use it to compress the toil, never to make the final security call. We keep a &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; of these workflows, and the same judgment applies to AI-assisted &lt;a href="https://dev.to/dashboard/code-review/"&gt;infrastructure code review&lt;/a&gt; — assistive acceleration, human sign-off.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;There is no single "best" DevSecOps toolchain. The best one is the one your team will actually use &lt;em&gt;consistently&lt;/em&gt;. A perfect scanner that everyone learns to ignore protects nothing — coverage you don't act on is worthless. So start small: pick the tools that fit your pipeline, wire them in as jobs that fail fast on genuine risk, and earn the right to add the next layer. Secrets detection, dependency and image scanning, IaC checks, and signed artifacts will stop the overwhelming majority of how teams actually get compromised. Get those running and reliable first, keep the human in the loop on the AI-assisted parts, and grow the stack as your infrastructure — and your team's appetite — matures.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Security scan output and AI-generated remediations are assistive, not authoritative. Always verify findings and fixes against your own systems before applying them in production.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/best-devsecops-security-tools-cicd-pipeline-protection/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>devsecops</category>
      <category>cicd</category>
      <category>security</category>
    </item>
    <item>
      <title>DevOps Security Best Practices Every Engineering Team Should Follow</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Fri, 26 Jun 2026 14:01:11 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/devops-security-best-practices-every-engineering-team-should-follow-18k5</link>
      <guid>https://dev.to/devopsaitoolkit/devops-security-best-practices-every-engineering-team-should-follow-18k5</guid>
      <description>&lt;p&gt;I've spent 25 years securing Linux boxes, cloud accounts, CI/CD pipelines, and production clusters. The single most consistent lesson across all of it is this: the teams that get breached aren't the ones who lacked a security department. They're the ones who treated security as something a separate department would handle &lt;em&gt;later&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Security is not a phase. It's not a gate at the end of the pipeline, and it's not a quarterly audit. It's a property of how you write infrastructure code, manage secrets, ship containers, and run production every single day. When security lives inside the daily workflow — in the merge request, the pipeline stage, the Terraform plan — it costs almost nothing. When it lives in a separate review at the end, it's expensive, late, and routinely skipped.&lt;/p&gt;

&lt;p&gt;This is the checklist I'd hand a new engineering team. Everything here is defensive: hardening, detection, and recovery. Work through it section by section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DevOps security belongs in the daily workflow
&lt;/h2&gt;

&lt;p&gt;The whole premise of DevOps was to stop throwing work over the wall between dev and ops. Security is the last wall standing in most orgs, and it has to come down the same way: by moving the controls into the tools engineers already use.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat every pull/merge request as a security review surface, not just a code review.&lt;/li&gt;
&lt;li&gt;Run security checks as pipeline stages that &lt;em&gt;fail the build&lt;/em&gt;, not as advisory reports nobody reads.&lt;/li&gt;
&lt;li&gt;Make the secure path the easy path — a hardened base image, a vetted Terraform module, a secrets helper — so engineers don't route around it.&lt;/li&gt;
&lt;li&gt;Assign a security owner per service, not a security team for the whole company. Ownership beats oversight.&lt;/li&gt;
&lt;li&gt;Measure mean-time-to-remediate for vulnerabilities the same way you measure deploy frequency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a control only exists in a wiki page, it doesn't exist. If it exists in the pipeline, it's real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Secure access control and least privilege
&lt;/h2&gt;

&lt;p&gt;Most incidents I've cleaned up came down to one over-privileged credential. Least privilege is boring and it's the highest-leverage thing on this list.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default every IAM role, Kubernetes ServiceAccount, and Linux user to &lt;em&gt;zero&lt;/em&gt; permissions, then add only what's needed.&lt;/li&gt;
&lt;li&gt;Replace standing admin access with just-in-time elevation that expires automatically.&lt;/li&gt;
&lt;li&gt;Scope cloud roles to specific resources and actions — no &lt;code&gt;*:*&lt;/code&gt; policies, ever.&lt;/li&gt;
&lt;li&gt;In Kubernetes, use RBAC &lt;code&gt;Roles&lt;/code&gt; bound to namespaces rather than &lt;code&gt;ClusterRole&lt;/code&gt; bindings wherever possible.&lt;/li&gt;
&lt;li&gt;Separate human identities from machine identities. Humans get SSO; services get workload identity.&lt;/li&gt;
&lt;li&gt;Audit who can &lt;code&gt;sudo&lt;/code&gt;, who's in the &lt;code&gt;docker&lt;/code&gt; group (that's root-equivalent), and who holds cloud admin — quarterly, in writing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  SSH key management and MFA
&lt;/h2&gt;

&lt;p&gt;SSH is still how a huge amount of production gets touched, and it's still where credential hygiene quietly rots.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Disable password authentication entirely: &lt;code&gt;PasswordAuthentication no&lt;/code&gt; and &lt;code&gt;PermitRootLogin no&lt;/code&gt; in &lt;code&gt;sshd_config&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Use per-user keys, never a shared key passed around in a chat thread.&lt;/li&gt;
&lt;li&gt;Prefer short-lived SSH certificates from a CA over long-lived static keys; rotate the rest on a schedule.&lt;/li&gt;
&lt;li&gt;Put a bastion/jump host in front of production and log every session through it.&lt;/li&gt;
&lt;li&gt;Require MFA on every identity provider, VPN, and cloud console — phishing-resistant (WebAuthn/hardware keys) for anyone with production access.&lt;/li&gt;
&lt;li&gt;Pull keys for departed team members the same day, and audit &lt;code&gt;authorized_keys&lt;/code&gt; files for orphans.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Secrets management: API keys, passwords, and tokens
&lt;/h2&gt;

&lt;p&gt;The fastest way to leak a secret is to commit it. The second fastest is to print it. Both are entirely preventable.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never store secrets in git — not in code, not in &lt;code&gt;.env&lt;/code&gt;, not in a "temporary" YAML file. Add a pre-commit secret scanner (gitleaks or trufflehog) to block it.&lt;/li&gt;
&lt;li&gt;Centralize secrets in a real secrets manager: HashiCorp Vault, a cloud secrets manager, or equivalent.&lt;/li&gt;
&lt;li&gt;For Kubernetes, use Sealed Secrets or an external-secrets operator so the cluster pulls from Vault at runtime — plain &lt;code&gt;Secret&lt;/code&gt; objects are only base64, not encrypted.&lt;/li&gt;
&lt;li&gt;Give every secret a rotation policy and an owner. Static credentials that never rotate are time bombs.&lt;/li&gt;
&lt;li&gt;Inject secrets as runtime environment values or mounted files, not baked into container images or Terraform state.&lt;/li&gt;
&lt;li&gt;Scan your git history, not just the current tree — a secret deleted in HEAD is still in the log until you rotate it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CI/CD pipeline security
&lt;/h2&gt;

&lt;p&gt;Your pipeline has credentials to everything. That makes it one of the highest-value targets you own, and it's frequently the least hardened.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Protect your main branches: require reviews, status checks, and signed commits before merge.&lt;/li&gt;
&lt;li&gt;Mark CI/CD variables as &lt;strong&gt;protected&lt;/strong&gt; and &lt;strong&gt;masked&lt;/strong&gt; so they're only exposed on protected branches and never echoed to logs.&lt;/li&gt;
&lt;li&gt;In GitLab CI, scope variables to environments and never &lt;code&gt;echo&lt;/code&gt; a secret — masking helps, but the discipline of not printing it is what saves you.&lt;/li&gt;
&lt;li&gt;Replace long-lived cloud keys in CI with short-lived credentials via OIDC. Let the pipeline exchange its identity for a temporary, scoped token instead of holding a static &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Pin and review your pipeline dependencies — third-party CI templates and actions run with your pipeline's privileges.&lt;/li&gt;
&lt;li&gt;Require manual approval for production deploys, and make the deploy job itself least-privileged.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A leaked CI variable is a leaked production credential. Treat the pipeline config with the same care you'd treat root.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container image scanning
&lt;/h2&gt;

&lt;p&gt;A container is only as trustworthy as the layers underneath it. Most images ship with known CVEs the team never looked at.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scan every image with Trivy (or Grype) as a GitLab pipeline stage &lt;strong&gt;before push&lt;/strong&gt;, and fail the build on high/critical findings:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;container_scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasec/trivy:latest&lt;/span&gt;
    &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;trivy image --exit-code 1 --severity HIGH,CRITICAL "$IMAGE_TAG"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Start from minimal base images (distroless or slim) to shrink the attack surface.&lt;/li&gt;
&lt;li&gt;Run containers as a non-root user (&lt;code&gt;USER&lt;/code&gt; in the Dockerfile) with a read-only root filesystem where possible.&lt;/li&gt;
&lt;li&gt;Drop all Linux capabilities and add back only what's required.&lt;/li&gt;
&lt;li&gt;Pin base images by digest, not by floating &lt;code&gt;:latest&lt;/code&gt; tags, and rebuild regularly to pick up patches.&lt;/li&gt;
&lt;li&gt;Sign images and verify signatures at admission so only your builds run in your cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Infrastructure as Code security
&lt;/h2&gt;

&lt;p&gt;IaC is where a one-line mistake becomes a fleet-wide misconfiguration. The good news: it's also where automated policy catches it before it ships.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review Terraform and Ansible changes like application code — every change goes through a merge request with a human reviewer.&lt;/li&gt;
&lt;li&gt;Run static analysis on IaC in the pipeline: &lt;code&gt;tfsec&lt;/code&gt;/Checkov for Terraform, &lt;code&gt;ansible-lint&lt;/code&gt; and &lt;code&gt;kube-linter&lt;/code&gt; for the rest.&lt;/li&gt;
&lt;li&gt;Adopt policy-as-code (OPA/Conftest or Sentinel) so rules like "no public S3 buckets" and "no &lt;code&gt;0.0.0.0/0&lt;/code&gt; on port 22" are enforced automatically, not remembered by reviewers.&lt;/li&gt;
&lt;li&gt;Protect and encrypt Terraform state — it contains secrets in plaintext. Use a remote backend with locking and access controls.&lt;/li&gt;
&lt;li&gt;For Ansible, encrypt sensitive variables with Vault and avoid &lt;code&gt;become&lt;/code&gt; where it isn't needed.&lt;/li&gt;
&lt;li&gt;Diff the &lt;code&gt;plan&lt;/code&gt; before every &lt;code&gt;apply&lt;/code&gt; and require approval for changes to security groups, IAM, and networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want a structured second opinion on a risky module, an automated &lt;a href="https://dev.to/dashboard/code-review/"&gt;infrastructure code review&lt;/a&gt; catches the misconfigurations a tired reviewer skims past at the end of the day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patch management and vulnerability remediation
&lt;/h2&gt;

&lt;p&gt;Unpatched systems are the most common root cause of breaches, and the least glamorous to fix. Make it routine so it isn't a decision.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automate OS patching with unattended security updates on Linux, and rebuild container images on a cadence rather than letting them age.&lt;/li&gt;
&lt;li&gt;Track your dependencies with SBOMs so you can answer "are we affected?" the day a CVE drops.&lt;/li&gt;
&lt;li&gt;Subscribe to advisories for your stack and define an SLA: criticals patched in days, highs in a week or two.&lt;/li&gt;
&lt;li&gt;Use Dependabot/Renovate to open dependency-bump PRs automatically and run them through your test suite.&lt;/li&gt;
&lt;li&gt;Keep an inventory of every host, image, and cluster version — you can't patch what you don't know you run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Monitoring, logging, and alerting for security events
&lt;/h2&gt;

&lt;p&gt;You cannot respond to what you can't see. Detection is the difference between a contained incident and a postmortem that starts with "we think they were in for three months."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable Linux &lt;code&gt;auditd&lt;/code&gt; and ship &lt;code&gt;/var/log/auth.log&lt;/code&gt;, sudo events, and SSH activity to centralized, append-only storage.&lt;/li&gt;
&lt;li&gt;Export security metrics to Prometheus and build Grafana dashboards plus alerts for anomalies: failed-login spikes, new sudo grants, unexpected outbound connections, root logins, container escapes.&lt;/li&gt;
&lt;li&gt;Alert on auditd/SSH anomalies in Grafana — a burst of failed SSH from a new ASN, or a successful root login outside business hours, should page someone.&lt;/li&gt;
&lt;li&gt;Turn on cloud audit logging (CloudTrail or equivalent) and alert on IAM policy changes, new access keys, and security-group edits.&lt;/li&gt;
&lt;li&gt;Capture Kubernetes audit logs and alert on &lt;code&gt;exec&lt;/code&gt; into production pods and changes to RBAC.&lt;/li&gt;
&lt;li&gt;Keep logs immutable and retained long enough to investigate a slow-burn intrusion.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Backup and disaster recovery planning
&lt;/h2&gt;

&lt;p&gt;Ransomware and fat-fingered &lt;code&gt;terraform destroy&lt;/code&gt; have the same fix: backups you've actually tested. Untested backups are just hope.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow 3-2-1: three copies, two media types, one off-site and offline.&lt;/li&gt;
&lt;li&gt;Keep at least one immutable/air-gapped copy that a compromised admin credential can't delete.&lt;/li&gt;
&lt;li&gt;Encrypt backups at rest and control who can read and restore them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test restores on a schedule.&lt;/strong&gt; A backup you've never restored is a guess, not a recovery plan.&lt;/li&gt;
&lt;li&gt;Document RPO and RTO per service and verify your backup cadence actually meets them.&lt;/li&gt;
&lt;li&gt;Back up the things people forget: Terraform state, Vault data, etcd, and database credentials.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Incident response preparation
&lt;/h2&gt;

&lt;p&gt;The middle of an incident is the worst time to figure out your incident process. Prepare while it's calm.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write a runbook: who's on call, how to declare an incident, where to communicate, and how to reach the right people fast.&lt;/li&gt;
&lt;li&gt;Pre-define severity levels and the actions each triggers.&lt;/li&gt;
&lt;li&gt;Keep break-glass credentials in a sealed, audited path — available in a crisis, logged when used.&lt;/li&gt;
&lt;li&gt;Practice. Run a tabletop or game day at least quarterly so the steps are muscle memory.&lt;/li&gt;
&lt;li&gt;Have communication templates ready — customer-facing and internal — so comms don't stall the investigation.&lt;/li&gt;
&lt;li&gt;Draft blameless postmortems while the timeline is fresh, and turn action items into tracked work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AI-assisted security checks — review, never blind trust
&lt;/h2&gt;

&lt;p&gt;AI is genuinely useful for security work: it reads more YAML, Terraform, and logs than you can, and it's fast at spotting the misconfiguration buried in a 400-line diff. The rule is the same one I apply to everything an AI generates — &lt;strong&gt;it drafts, you decide.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use AI to review IaC and pipeline configs for missing controls, over-broad permissions, and risky defaults — as a first-pass reviewer, not the final word.&lt;/li&gt;
&lt;li&gt;Have it summarize scanner output and rank findings by real-world exploitability so your team fixes what matters first.&lt;/li&gt;
&lt;li&gt;Let it draft hardening changes and policy rules, then read every line before you apply it — confident and correct are not the same thing.&lt;/li&gt;
&lt;li&gt;Never paste live secrets, real hostnames, or customer data into a model. Scrub first.&lt;/li&gt;
&lt;li&gt;Keep a human approving every change that touches IAM, networking, or production. AI accelerates the work; it doesn't own the risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want vetted starting points, our &lt;a href="https://dev.to/categories/security-hardening/"&gt;security &amp;amp; hardening prompts&lt;/a&gt; cover image scanning, Linux hardening, and IaC review, and the broader &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; has the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;None of these practices are exotic. Least privilege, managed secrets, scanned images, policy-checked infrastructure, patched hosts, real monitoring, tested backups, and a rehearsed incident plan — every one is achievable this quarter, and every one is cheaper to build into the workflow than to bolt on after a breach.&lt;/p&gt;

&lt;p&gt;And here's the part that doesn't show up on the engineering scorecard: secure DevOps is a competitive advantage. Customers run security questionnaires before they sign. Investors ask about your posture in diligence. Partners won't integrate with a platform they don't trust. The companies that win those deals are the ones who can show, not claim, that protecting customer systems is built into how they work every day.&lt;/p&gt;

&lt;p&gt;Security isn't the tax you pay to ship. It's part of why people let you ship to them at all.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/devops-security-best-practices-engineering-teams/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>security</category>
      <category>devsecops</category>
      <category>hardening</category>
    </item>
    <item>
      <title>Why AI Loves Ansible (And You Should Let It Help)</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 25 Jun 2026 04:36:48 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/why-ai-loves-ansible-and-you-should-let-it-help-3o2p</link>
      <guid>https://dev.to/devopsaitoolkit/why-ai-loves-ansible-and-you-should-let-it-help-3o2p</guid>
      <description>&lt;p&gt;If you compare how well Claude handles Ansible against how well it handles, say, raw bash or kubectl YAML, Ansible wins by a wide margin. The reason isn't subtle: Ansible's shape — declarative, idempotent, modules-with-arguments — happens to map almost perfectly to how LLMs reason. They're good at producing structured output that fills in a known template, and that's what most Ansible tasks are.&lt;/p&gt;

&lt;p&gt;This means AI-assisted Ansible work is the highest-leverage automation pairing I know of. If you only adopt AI for one infrastructure tool, make it Ansible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Ansible AI-friendly
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Modules have published contracts
&lt;/h3&gt;

&lt;p&gt;Every Ansible module has a documented argument spec: what's required, what's optional, what the defaults are. The model can fit your intent into the spec with high accuracy because the spec is finite and known.&lt;/p&gt;

&lt;p&gt;Compare this to shell: there are a thousand ways to "create a user with a specific UID, member of these groups, with this shell, and a home directory in this location." In bash, every distro is slightly different. In Ansible, you use &lt;code&gt;ansible.builtin.user&lt;/code&gt; with named arguments.&lt;/p&gt;

&lt;p&gt;The model gets this right &lt;em&gt;every single time&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Idempotency is the default
&lt;/h3&gt;

&lt;p&gt;When a model generates a Python script, it has to think about "what if this is run twice." When it generates Ansible, most modules handle that for free. The model can write the task, ignore the re-run case, and produce something that works.&lt;/p&gt;

&lt;p&gt;This means the cognitive load on both sides — model and human — is lower. You're describing the target state, not the procedure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Roles and structure are predictable
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;roles/foo/{defaults,vars,tasks,handlers,templates,files,meta}/main.yml&lt;/code&gt; — every Ansible role looks the same. The model can scaffold a new role in seconds because the layout is fixed.&lt;/p&gt;

&lt;p&gt;If you ask Claude to "create a new role for installing PostgreSQL 16 on Ubuntu 24.04 with default user &lt;code&gt;postgres&lt;/code&gt; and a tuned &lt;code&gt;postgresql.conf&lt;/code&gt;," you'll get a complete role structure with &lt;code&gt;defaults/main.yml&lt;/code&gt;, &lt;code&gt;tasks/main.yml&lt;/code&gt;, a Jinja template, and &lt;code&gt;handlers/main.yml&lt;/code&gt; — all consistent, all in the right places. The structure is constrained enough that the model rarely improvises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use cases where AI shines for Ansible
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Generating new roles from scratch
&lt;/h3&gt;

&lt;p&gt;This is the killer app. You can describe a role in two sentences and get a 90%-done implementation. You then refine: add validation, adjust defaults, write a README.&lt;/p&gt;

&lt;p&gt;I now treat "draft a new role with Claude" as the default first step. Even if I rewrite half of it, the structure saves me 20 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Converting shell scripts to playbooks
&lt;/h3&gt;

&lt;p&gt;If you have a legacy bash script that provisions a server, pasting it into Claude with "convert this to an idempotent Ansible playbook using the appropriate modules" produces a usable result. The model knows when to use &lt;code&gt;ansible.builtin.file&lt;/code&gt;, &lt;code&gt;lineinfile&lt;/code&gt;, &lt;code&gt;template&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;You'll need to verify the idempotency manually (run twice, expect 0 changes on the second run), but the conversion is mostly mechanical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refactoring playbooks to use FQCN
&lt;/h3&gt;

&lt;p&gt;Ansible 2.10+ wants fully-qualified collection names: &lt;code&gt;ansible.builtin.package&lt;/code&gt; instead of &lt;code&gt;package&lt;/code&gt;. Old playbooks have hundreds of short-form references. AI is a perfect fit for this kind of mass refactoring — it knows the mapping and won't get bored.&lt;/p&gt;

&lt;p&gt;Paste a 200-line playbook, ask for it back with FQCN throughout, and you're done in 30 seconds. Verify with &lt;code&gt;ansible-lint&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing Molecule tests
&lt;/h3&gt;

&lt;p&gt;Molecule scaffolding is repetitive — same &lt;code&gt;molecule.yml&lt;/code&gt;, same &lt;code&gt;converge.yml&lt;/code&gt;, same &lt;code&gt;verify.yml&lt;/code&gt; structure for most roles. AI is great at generating the boilerplate. You describe what you want to test; the model writes the assertion playbook.&lt;/p&gt;

&lt;h3&gt;
  
  
  Jinja template generation
&lt;/h3&gt;

&lt;p&gt;Jinja is just structured-enough that AI handles it well — generating templates for config files (nginx, postgres, sshd) from a description of the desired behavior. The model knows the configuration keys and the conditional structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where AI struggles with Ansible
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Variable precedence
&lt;/h3&gt;

&lt;p&gt;Ansible's 21-layer variable precedence rules are not intuitive. The model will sometimes suggest putting a variable in &lt;code&gt;vars/main.yml&lt;/code&gt; when you really want it in &lt;code&gt;defaults/main.yml&lt;/code&gt; (the former overrides the latter). The result: users of your role can't override the variable they expected to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When the model puts something in &lt;code&gt;vars/&lt;/code&gt;, ask "should this be overridable by the role user?" If yes, move to &lt;code&gt;defaults/&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Custom facts and &lt;code&gt;set_fact&lt;/code&gt; lifetime
&lt;/h3&gt;

&lt;p&gt;The model sometimes uses &lt;code&gt;set_fact&lt;/code&gt; for values that need to persist across plays, but doesn't add &lt;code&gt;cacheable: true&lt;/code&gt;. The fact is then gone after the play ends, and the next play sees &lt;code&gt;undefined&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When you use &lt;code&gt;set_fact&lt;/code&gt; for a value you need later, verify the lifetime is what you expect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vault integration
&lt;/h3&gt;

&lt;p&gt;The model will sometimes generate playbooks that reference &lt;code&gt;vault_db_password&lt;/code&gt; as a variable but don't include the &lt;code&gt;lookup('community.hashi_vault.hashi_vault', ...)&lt;/code&gt; call or the Ansible Vault encrypted file. You have to wire up the secret source separately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; For any sensitive variable in a generated playbook, verify there's an actual source for it (Vault encrypted file, external manager lookup, environment variable).&lt;/p&gt;

&lt;h3&gt;
  
  
  Distro-specific paths
&lt;/h3&gt;

&lt;p&gt;The model defaults to Debian/Ubuntu conventions. If you run on RHEL, you'll sometimes get &lt;code&gt;apt&lt;/code&gt; modules in tasks that should be using the &lt;code&gt;package&lt;/code&gt; module (or distro conditionals).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; When generating playbooks for non-Debian systems, audit for &lt;code&gt;apt&lt;/code&gt;, &lt;code&gt;apt_repository&lt;/code&gt;, &lt;code&gt;dpkg_selections&lt;/code&gt;, and ask for the abstraction (&lt;code&gt;package&lt;/code&gt;) or the distro split.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that's been working for me
&lt;/h2&gt;

&lt;p&gt;For a new role, my process now looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Describe the role&lt;/strong&gt; to Claude in 2-3 sentences (purpose, target distros, key behaviors).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate the scaffolding&lt;/strong&gt;: &lt;code&gt;defaults/main.yml&lt;/code&gt;, &lt;code&gt;tasks/main.yml&lt;/code&gt;, a template if needed, &lt;code&gt;meta/main.yml&lt;/code&gt; with platforms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every task.&lt;/strong&gt; Look for the failure modes above (precedence, lifetime, Vault, distros).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add Molecule tests.&lt;/strong&gt; Have Claude scaffold &lt;code&gt;molecule/default/&lt;/code&gt;, then write the assertions yourself or ask for them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run &lt;code&gt;ansible-lint&lt;/code&gt; and Molecule.&lt;/strong&gt; Fix what they catch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idempotence check.&lt;/strong&gt; Run the role twice; second run should report 0 changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Refine the README.&lt;/strong&gt; This is the one place I write from scratch — explaining the role to future-me.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This takes maybe 30 minutes for a moderately complex role. Without AI assistance, the same role would take me a couple of hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on safety
&lt;/h2&gt;

&lt;p&gt;Ansible runs as root on production servers. Whatever the model generates, &lt;em&gt;you&lt;/em&gt; are responsible for what it does. Two patterns I follow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check &lt;code&gt;--check --diff&lt;/code&gt; before any real run.&lt;/strong&gt; Dry-run the playbook in check mode; verify the diff matches what you expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test on a sandbox host first.&lt;/strong&gt; Especially for new roles. Don't trust the model with production until the role has run cleanly on a throwaway VM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the same disciplines that apply to any infrastructure change. AI doesn't change the discipline; it just makes you faster at the parts before the change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I think Ansible is the right entry point
&lt;/h2&gt;

&lt;p&gt;If you're new to using AI for infrastructure work and want to pick one tool to start with, Ansible is the safest, highest-leverage choice. The structure makes the AI accurate. The idempotency makes mistakes recoverable. The module ecosystem covers most common cases.&lt;/p&gt;

&lt;p&gt;By the time you've used AI to write a dozen Ansible playbooks, you'll have developed the intuition for what AI handles well and what needs human attention. That intuition transfers to harder tools — Terraform, Kubernetes, custom shell — where the cost of AI mistakes is higher.&lt;/p&gt;

&lt;p&gt;For our full set of AI-driven Ansible workflows, see the &lt;a href="https://dev.to/categories/iac/"&gt;IaC category&lt;/a&gt; — including &lt;a href="https://dev.to/prompts/ansible-vault-secrets-management/"&gt;ansible-vault-secrets-management&lt;/a&gt; and &lt;a href="https://dev.to/prompts/ansible-molecule-testing/"&gt;ansible-molecule-testing&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/why-ai-loves-ansible/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ansible</category>
      <category>ai</category>
      <category>automation</category>
    </item>
    <item>
      <title>AI for GitLab CI Authoring: Save Hours, Avoid Footguns</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sat, 20 Jun 2026 12:15:22 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/ai-for-gitlab-ci-authoring-save-hours-avoid-footguns-3lco</link>
      <guid>https://dev.to/devopsaitoolkit/ai-for-gitlab-ci-authoring-save-hours-avoid-footguns-3lco</guid>
      <description>&lt;p&gt;GitLab CI YAML is one of those formats where you can stare at it for an hour, get it 95% right, and have it fail with &lt;code&gt;yaml: line 12: did not find expected key&lt;/code&gt; because of a tab character. AI assistants are very fast at this kind of work. They're also confidently wrong about specific GitLab features in ways that waste a lot of time if you don't know what to check.&lt;/p&gt;

&lt;p&gt;After a year of letting Claude write a lot of my pipelines, here's what works and what doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI gets right consistently
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Standard job shapes
&lt;/h3&gt;

&lt;p&gt;"Write me a job that builds a Docker image, pushes to the GitLab Container Registry, and tags with the commit SHA and &lt;code&gt;latest&lt;/code&gt; on the default branch." Type that into Claude and you get a working job in five seconds. The shape is well-established and the model has seen thousands of variations.&lt;/p&gt;

&lt;p&gt;The same is true for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test jobs across languages (pytest, jest, go test, etc.)&lt;/li&gt;
&lt;li&gt;Standard cache configurations&lt;/li&gt;
&lt;li&gt;Standard artifact patterns&lt;/li&gt;
&lt;li&gt;Basic &lt;code&gt;rules:&lt;/code&gt; for branch / tag / MR pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you find yourself writing one of these from scratch, you're spending time that you don't need to spend.&lt;/p&gt;

&lt;h3&gt;
  
  
  Translating from other CIs
&lt;/h3&gt;

&lt;p&gt;GitLab CI has obvious parallels to GitHub Actions, CircleCI, Jenkins declarative pipelines, etc. AI is &lt;em&gt;excellent&lt;/em&gt; at translating between them. The structures rhyme; the model knows the dictionary.&lt;/p&gt;

&lt;p&gt;If you're migrating from Actions to GitLab CI, paste the workflow and ask for the GitLab CI equivalent. You'll get something 80% right that you can refine.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reviewing pipelines for inefficiency
&lt;/h3&gt;

&lt;p&gt;This is the underrated use case. Paste your &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; and ask: "what's the critical path of this pipeline, and what's making it slow?" The model will spot things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Your test job downloads node_modules from cache, but install-deps doesn't push to cache — your cache key is broken."&lt;/li&gt;
&lt;li&gt;"Your build and deploy stages are sequential but build's artifacts aren't used by deploy — they can be parallel with &lt;code&gt;needs:&lt;/code&gt;."&lt;/li&gt;
&lt;li&gt;"Your &lt;code&gt;rules:changes:&lt;/code&gt; doesn't include &lt;code&gt;package-lock.json&lt;/code&gt;, so dependency changes don't retrigger tests."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are real findings I've gotten from Claude on pipelines I thought I'd already optimized. Worth the five-minute review.&lt;/p&gt;

&lt;h2&gt;
  
  
  What AI gets wrong — and how to catch it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;rules:&lt;/code&gt; vs &lt;code&gt;only/except&lt;/code&gt; confusion
&lt;/h3&gt;

&lt;p&gt;The model will sometimes mix them in the same job. GitLab silently ignores &lt;code&gt;only:&lt;/code&gt; when &lt;code&gt;rules:&lt;/code&gt; is also defined. The pipeline runs but the behavior isn't what you expect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Are you using &lt;code&gt;rules:&lt;/code&gt; OR &lt;code&gt;only:&lt;/code&gt;/&lt;code&gt;except:&lt;/code&gt; in each job? Pick one. (Use &lt;code&gt;rules:&lt;/code&gt; — &lt;code&gt;only/except&lt;/code&gt; is legacy.)&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;$CI_COMMIT_BRANCH&lt;/code&gt; empty on MR pipelines
&lt;/h3&gt;

&lt;p&gt;A common bug: you ask for "this job runs on the default branch" and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_BRANCH == "main"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is correct for branch pipelines. It is &lt;strong&gt;empty&lt;/strong&gt; on MR (&lt;code&gt;merge_request_event&lt;/code&gt;) pipelines. If you have MR pipelines enabled, your job silently won't run when developers expect it to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Does your pipeline target both push events and MR events? If so, you probably want &lt;code&gt;$CI_MERGE_REQUEST_TARGET_BRANCH_NAME&lt;/code&gt; or to handle both pipeline sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;needs:&lt;/code&gt; referencing hidden jobs
&lt;/h3&gt;

&lt;p&gt;Hidden jobs (prefixed with &lt;code&gt;.&lt;/code&gt;) are templates — they don't execute. If you do &lt;code&gt;needs: [".lint"]&lt;/code&gt;, your job will fail with a confusing error because GitLab thinks you're depending on a job that doesn't exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; Every &lt;code&gt;needs:&lt;/code&gt; entry should be a real job name, not a template.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-apply rules that don't include the right branches
&lt;/h3&gt;

&lt;p&gt;The model loves writing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_COMMIT_BRANCH == "main"&lt;/span&gt;
    &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;when&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;never&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works on &lt;code&gt;main&lt;/code&gt; but blocks the job on tags, on schedules, and on MR pipelines. Sometimes that's what you want. Often it's not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; What pipeline sources do you expect this job to run in? List them, then verify your rules cover each.&lt;/p&gt;

&lt;h3&gt;
  
  
  Imaginary GitLab features
&lt;/h3&gt;

&lt;p&gt;This is the most expensive AI failure mode. The model will sometimes generate syntax for features that don't exist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;code&gt;condition:&lt;/code&gt; field that's actually OPA/Conftest, not GitLab CI&lt;/li&gt;
&lt;li&gt;An &lt;code&gt;auto_retry:&lt;/code&gt; block that's GitHub Actions, not GitLab&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;before_script:&lt;/code&gt; keyword that does exist but with different semantics than the model claims&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Check:&lt;/strong&gt; If you see a keyword you haven't seen before in GitLab docs, verify it exists. The lint endpoint (&lt;code&gt;/api/v4/ci/lint&lt;/code&gt;) catches most of these, but some pass lint and just behave weirdly.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that catches the failures cheaply
&lt;/h2&gt;

&lt;p&gt;I now do this for any non-trivial pipeline change:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Draft with AI.&lt;/strong&gt; Describe the desired behavior in plain English; let the model write the YAML.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read every line.&lt;/strong&gt; Treat the output as a draft you'd write yourself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lint via the API.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"PRIVATE-TOKEN: &lt;/span&gt;&lt;span class="nv"&gt;$TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--header&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;content&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; .gitlab-ci.yml | jq &lt;span class="nt"&gt;-Rs&lt;/span&gt; .&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
       &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GITLAB_URL&lt;/span&gt;&lt;span class="s2"&gt;/api/v4/ci/lint"&lt;/span&gt; | jq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run on a sandbox branch.&lt;/strong&gt; Push to a branch that won't trigger deploys; verify the pipeline runs the jobs you expect, when you expect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diff against the existing pipeline.&lt;/strong&gt; If the AI introduced changes you didn't ask for (a different cache key, a removed &lt;code&gt;interruptible:&lt;/code&gt;), revert them.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Step 5 is the one most people skip. The model is good at writing YAML but not at preserving your previous decisions. If you don't diff, you'll lose your old cache strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical example
&lt;/h2&gt;

&lt;p&gt;Last month I needed to add a job that runs &lt;code&gt;terraform plan&lt;/code&gt; on every MR and posts the output as a comment. Drafted with Claude in two minutes; it produced something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;terraform-plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hashicorp/terraform:1.9&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;plan&lt;/span&gt;
  &lt;span class="na"&gt;script&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform init&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform plan -out=tfplan -no-color&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;terraform show -no-color tfplan &amp;gt; plan.txt&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;curl -X POST -H "PRIVATE-TOKEN: $GITLAB_API_TOKEN" \&lt;/span&gt;
          &lt;span class="s"&gt;-d "body=$(cat plan.txt | jq -Rs .)" \&lt;/span&gt;
          &lt;span class="s"&gt;"$CI_API_V4_URL/projects/$CI_PROJECT_ID/merge_requests/$CI_MERGE_REQUEST_IID/notes"&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;$CI_PIPELINE_SOURCE == "merge_request_event"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;em&gt;almost&lt;/em&gt; right. Two issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;PRIVATE-TOKEN&lt;/code&gt; as a CI variable&lt;/strong&gt; — using a personal access token for CI is the old pattern. Modern approach: use &lt;code&gt;$CI_JOB_TOKEN&lt;/code&gt; for in-instance API calls. Saves rotation pain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;terraform init -backend-config&lt;/code&gt;&lt;/strong&gt; — works if the backend is configured in code, but if you have multiple environments using the same module, you'd want to specify which backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both fixes are 30 seconds. Without the AI I'd have spent 15 minutes writing the curl invocation alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;AI doesn't replace knowing GitLab CI. It removes the typing and the boilerplate so you can spend your attention on the parts that matter — the &lt;code&gt;rules:&lt;/code&gt; logic, the cache keys, the secrets, the environment promotion.&lt;/p&gt;

&lt;p&gt;Once you've internalized the failure modes above, the workflow becomes mostly automatic. You stop reading the boilerplate and start reading the rules. That's where the bugs live.&lt;/p&gt;

&lt;p&gt;For the prompt set we use on GitLab CI specifically, see the &lt;a href="https://dev.to/categories/gitlab-cicd/"&gt;GitLab CI/CD category&lt;/a&gt; — particularly &lt;a href="https://dev.to/prompts/gitlab-pipeline-optimization/"&gt;gitlab-pipeline-optimization&lt;/a&gt; and &lt;a href="https://dev.to/prompts/gitlab-ci-rules-debugging/"&gt;gitlab-ci-rules-debugging&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/ai-for-gitlab-ci-authoring/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>gitlab</category>
      <category>cicd</category>
      <category>ai</category>
    </item>
    <item>
      <title>Securing AI-Generated Bash Scripts Before You Run Them</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 18 Jun 2026 15:51:56 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/securing-ai-generated-bash-scripts-before-you-run-them-401m</link>
      <guid>https://dev.to/devopsaitoolkit/securing-ai-generated-bash-scripts-before-you-run-them-401m</guid>
      <description>&lt;p&gt;Bash is the easiest language for AI to write and the easiest language to get devastating output from. A 20-line script that "just cleans up old files" can recursively delete a home directory because the model assumed a variable would always be set. A "simple log shipper" can write your secrets to a remote server because the model used &lt;code&gt;set -x&lt;/code&gt; for debugging and forgot to remove it.&lt;/p&gt;

&lt;p&gt;I have run AI-generated bash that I should not have. Most engineers I know have too. After enough close calls, there's a short checklist that catches the worst of it. This is that checklist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five things to check before running any AI-generated bash
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Does it start with a strict pragma?
&lt;/h3&gt;

&lt;p&gt;The first lines of any non-trivial bash script should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What each does:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -e&lt;/code&gt;&lt;/strong&gt; — exit on any command failure. Without this, a failure in line 5 doesn't stop the script from happily running lines 6-50.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -u&lt;/code&gt;&lt;/strong&gt; — error on undefined variables. This is the one that saves you from &lt;code&gt;rm -rf $UNDEFINED/&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set -o pipefail&lt;/code&gt;&lt;/strong&gt; — propagate failures through pipes. Without it, &lt;code&gt;failing-command | grep something&lt;/code&gt; succeeds because grep succeeds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IFS=$'\n\t'&lt;/code&gt;&lt;/strong&gt; — sane field splitting. Defends against word-splitting bugs in filenames.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the AI-generated script doesn't have these, add them and re-read the script. You'll often discover bugs the pragma now flags.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Is every variable expansion quoted?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Wrong&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="nv"&gt;$TARGET_DIR&lt;/span&gt;

&lt;span class="c"&gt;# Right&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$TARGET_DIR&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The wrong version is what causes the "I deleted the root directory" stories. If &lt;code&gt;$TARGET_DIR&lt;/code&gt; is empty or contains a space, the command becomes &lt;code&gt;rm -rf&lt;/code&gt; (delete current directory) or &lt;code&gt;rm -rf foo bar&lt;/code&gt; (delete two unintended things).&lt;/p&gt;

&lt;p&gt;Models default to the wrong version about half the time because the right version is harder to write in chat ("escape the quotes!") and the wrong version is what most blogs show.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; When reading AI bash, mentally check every &lt;code&gt;$VAR&lt;/code&gt; for quotes. Add them if missing. This is the single biggest source of bash disasters.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. What happens if a step fails partway through?
&lt;/h3&gt;

&lt;p&gt;The AI will cheerfully write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /opt/new-app
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/new-app
&lt;span class="nb"&gt;tar &lt;/span&gt;xzf &lt;span class="nv"&gt;$TARBALL&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nv"&gt;$TARBALL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What happens if &lt;code&gt;tar xzf&lt;/code&gt; fails (corrupt tarball, full disk)? With &lt;code&gt;set -e&lt;/code&gt;, the script stops. Good. Without &lt;code&gt;set -e&lt;/code&gt;, it continues to &lt;code&gt;rm $TARBALL&lt;/code&gt; and deletes your tarball with no backup.&lt;/p&gt;

&lt;p&gt;For any state-changing script, ask yourself: at each step, what's the recovery path if the step fails? If the answer is "nothing automated," the script should at least &lt;em&gt;not delete data&lt;/em&gt; before verifying the previous step succeeded.&lt;/p&gt;

&lt;p&gt;The AI almost never thinks about this on its own.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Are secrets visible in logs?
&lt;/h3&gt;

&lt;p&gt;The most common way AI-generated bash leaks secrets is via &lt;code&gt;set -x&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-x&lt;/span&gt;  &lt;span class="c"&gt;# debugging&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://api.example.com/...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;set -x&lt;/code&gt;, every command is printed including the expanded variables. Your API token is now in the script's output, which is in your CI logs, which are visible to anyone with project access.&lt;/p&gt;

&lt;p&gt;The fix is selective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;set&lt;/span&gt; +x  &lt;span class="c"&gt;# disable trace&lt;/span&gt;
curl &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$API_TOKEN&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; https://api.example.com/...
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-x&lt;/span&gt;  &lt;span class="c"&gt;# re-enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or simply remove &lt;code&gt;set -x&lt;/code&gt; once debugging is done. The model frequently leaves it in.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Does it run as root unnecessarily?
&lt;/h3&gt;

&lt;p&gt;The AI will sometimes write &lt;code&gt;sudo&lt;/code&gt; into every command, even ones that don't need it. Or it'll assume the script runs as root and use absolute paths that require root to write.&lt;/p&gt;

&lt;p&gt;The principle: if a command can run as a non-root user, it should. The smaller the privileged surface, the smaller the blast radius.&lt;/p&gt;

&lt;p&gt;This is especially important for scripts that download and execute code. A common pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dangerous: privileged download + execute&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'curl https://example.com/install.sh | bash'&lt;/span&gt;

&lt;span class="c"&gt;# Safer: review then run&lt;/span&gt;
curl https://example.com/install.sh &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; install.sh
&lt;span class="c"&gt;# READ install.sh&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;bash install.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the model generates the first pattern, replace it with the second. Always.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Last month I asked Claude to write a script that cleans up Docker images older than 30 days on a CI runner host. The first draft was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="nv"&gt;DOCKER_IMAGES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;docker images &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.ID}} {{.CreatedAt}}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;CUTOFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DOCKER_IMAGES&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;read &lt;/span&gt;ID DATE&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;$CREATED&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="nv"&gt;$CUTOFF&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;docker rmi &lt;span class="nv"&gt;$ID&lt;/span&gt;
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Walking the checklist:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No strict pragma.&lt;/strong&gt; Missing &lt;code&gt;set -euo pipefail&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unquoted &lt;code&gt;$DOCKER_IMAGES&lt;/code&gt;, &lt;code&gt;$ID&lt;/code&gt;, &lt;code&gt;$DATE&lt;/code&gt;.&lt;/strong&gt; Each one is a potential bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure handling.&lt;/strong&gt; &lt;code&gt;docker rmi&lt;/code&gt; fails if an image is in use. The script continues, marches through, and silently fails on every in-use image. We never know which were cleaned and which weren't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No secrets&lt;/strong&gt; (docker doesn't expose them here), but the script also doesn't log what it's doing, so you can't audit afterward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;sudo&lt;/code&gt;,&lt;/strong&gt; good — assumes the user has Docker socket access, which is reasonable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The hardened version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/usr/bin/env bash&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail
&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;$'&lt;/span&gt;&lt;span class="se"&gt;\n\t&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;

&lt;span class="nv"&gt;CUTOFF&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'30 days ago'&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;REMOVED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0
&lt;span class="nv"&gt;SKIPPED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0

&lt;span class="c"&gt;# Use --format with safer parsing&lt;/span&gt;
docker images &lt;span class="nt"&gt;--format&lt;/span&gt; &lt;span class="s1"&gt;'{{.ID}}|{{.CreatedAt}}'&lt;/span&gt; | &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nv"&gt;IFS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'|'&lt;/span&gt; &lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; ID DATE&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nv"&gt;CREATED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATE&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; +%s&lt;span class="si"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CREATED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-lt&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$CUTOFF&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        if &lt;/span&gt;docker rmi &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; 2&amp;gt;/dev/null&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Removed: &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
            &lt;span class="nv"&gt;REMOVED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;REMOVED &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;else
            &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Skipped (in use): &lt;/span&gt;&lt;span class="nv"&gt;$ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
            &lt;span class="nv"&gt;SKIPPED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;$((&lt;/span&gt;SKIPPED &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
        &lt;span class="k"&gt;fi
    fi
done

&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Cleanup complete. Removed: &lt;/span&gt;&lt;span class="nv"&gt;$REMOVED&lt;/span&gt;&lt;span class="s2"&gt;, Skipped: &lt;/span&gt;&lt;span class="nv"&gt;$SKIPPED&lt;/span&gt;&lt;span class="s2"&gt;."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This took two minutes of editing. Without the checklist, I might have run the original and noticed days later that disk usage hadn't really dropped because half the images were in use.&lt;/p&gt;

&lt;h2&gt;
  
  
  A small note on bash linting
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;shellcheck&lt;/code&gt; catches most of these issues automatically. If you adopt one tool from this article, make it shellcheck:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;shellcheck cleanup-images.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It will flag unquoted variables, missing strict mode, and a dozen other patterns. AI-generated bash usually has at least one shellcheck warning.&lt;/p&gt;

&lt;p&gt;I now run shellcheck on every script before I run the script itself. It's two seconds and catches things I'd miss.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the AI gets it right
&lt;/h2&gt;

&lt;p&gt;To be fair: the model is often perfectly capable of producing safe bash. If you prompt it explicitly — "write this with &lt;code&gt;set -euo pipefail&lt;/code&gt;, quote every variable, fail loudly on errors" — you'll get a clean script.&lt;/p&gt;

&lt;p&gt;The problem is that "write me a script that does X" without that prompt gets you the &lt;em&gt;common&lt;/em&gt; form of the script, which is the unsafe form. So the rule of thumb:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Always include the safety requirements in the prompt.&lt;/strong&gt; Or: always treat the output as a draft that needs hardening. Don't run any bash the AI wrote without one of those two disciplines.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bottom line
&lt;/h2&gt;

&lt;p&gt;Bash from AI is fast to produce and easy to read incorrectly. The checklist is short — strict pragma, quoted expansions, failure paths, secrets in logs, unnecessary privilege — and applying it takes a couple of minutes per script. The downside of skipping it is on the spectrum of "minor cleanup mistake" to "career incident." There's no excuse not to do the check.&lt;/p&gt;

&lt;p&gt;For our prompts on bash specifically, see &lt;a href="https://dev.to/prompts/bash-script-code-review/"&gt;bash-script-code-review&lt;/a&gt; and the related &lt;a href="https://dev.to/prompts/linux-server-hardening/"&gt;linux-server-hardening&lt;/a&gt; prompt — both of which cover related territory.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/securing-ai-generated-bash-scripts/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>bash</category>
      <category>security</category>
      <category>ai</category>
    </item>
    <item>
      <title>The Best AI Tools for DevOps Engineers in 2026</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Wed, 17 Jun 2026 20:59:44 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/the-best-ai-tools-for-devops-engineers-in-2026-15a9</link>
      <guid>https://dev.to/devopsaitoolkit/the-best-ai-tools-for-devops-engineers-in-2026-15a9</guid>
      <description>&lt;p&gt;If you spend your day in a terminal, a YAML editor, or a Grafana tab — AI assistants in 2026 are no longer a curiosity. They're a real productivity layer. But not every tool is good at infrastructure work. After a year of daily use across Linux administration, OpenStack operations, Prometheus alert authoring, and Kubernetes debugging, here's the honest shortlist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The criteria
&lt;/h2&gt;

&lt;p&gt;We're not ranking on benchmark scores. We're ranking on &lt;strong&gt;infrastructure usefulness&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning over command output&lt;/strong&gt; — can it actually read &lt;code&gt;top&lt;/code&gt;, &lt;code&gt;kubectl describe&lt;/code&gt;, or &lt;code&gt;journalctl&lt;/code&gt; and find the real problem?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt; — does it warn before suggesting destructive commands?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long context&lt;/strong&gt; — can it hold a 1,000-line &lt;code&gt;.gitlab-ci.yml&lt;/code&gt; plus failing logs without losing track?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terminal integration&lt;/strong&gt; — can you use it without leaving your workflow?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy and self-host options&lt;/strong&gt; — for the engineers whose employers care.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The shortlist
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Claude (Anthropic)
&lt;/h3&gt;

&lt;p&gt;The current best general assistant for infrastructure reasoning. Long context handles enormous log dumps and Kubernetes manifests in one shot. It is consistently more cautious about destructive commands than alternatives — which matters when you're tired at 2am and tempted to copy-paste straight into prod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Linux/OpenStack/Kubernetes troubleshooting, postmortem drafting, code review on infrastructure-as-code.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. ChatGPT (OpenAI)
&lt;/h3&gt;

&lt;p&gt;The broadest ecosystem. Strong code generation, plug-in support, and the largest community of shared prompts and patterns. For Ansible and Terraform generation, output quality is excellent. Slightly less cautious by default — you'll want to add safety constraints in your prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Ansible/Terraform generation, ad-hoc scripting, learning new tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Cursor
&lt;/h3&gt;

&lt;p&gt;If you live in an IDE, Cursor is what your IDE should have been. Native multi-file context, agent mode for repo-wide refactors, and tab-completion that actually understands your codebase. Especially strong for IaC repositories with many interconnected files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Editing real codebases (Helm charts, Terraform modules, Python operators).&lt;/p&gt;

&lt;h3&gt;
  
  
  4. GitHub Copilot
&lt;/h3&gt;

&lt;p&gt;The lowest-friction option. Inline completion just works, and the chat sidebar is genuinely useful for "explain this regex" or "what's this PromQL doing?" If your org already pays for GitHub, Copilot is essentially free upside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Inline completion while editing YAML, Bash, Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Warp Terminal (with AI features)
&lt;/h3&gt;

&lt;p&gt;The only entry on this list that isn't an AI assistant per se — it's a terminal that has AI built in. The killer feature: natural-language command suggestions in your shell, with safety previews. For Linux admins who don't want to alt-tab to a chat window every five seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Terminal-native workflows where context-switching kills focus.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we don't recommend (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generic LLM wrappers that promise "DevOps AI."&lt;/strong&gt; Most are thin layers over the same APIs above, sometimes with worse safety defaults. Use the underlying tools directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anything that requires uploading your &lt;code&gt;~/.ssh&lt;/code&gt; directory or production credentials.&lt;/strong&gt; Be skeptical of "AI agents that run commands for you" without a clear sandbox model.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to combine them
&lt;/h2&gt;

&lt;p&gt;A pattern that works well in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Claude or ChatGPT in a browser&lt;/strong&gt; for deep diagnosis sessions (paste logs, walk through hypotheses, draft postmortems).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cursor or Copilot in your editor&lt;/strong&gt; for actually writing the fix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warp&lt;/strong&gt; in the terminal for quick command lookups without switching context.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need one perfect tool. You need a workflow where each tool plays to its strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/claude-linux-troubleshooting/"&gt;How to Use Claude to Troubleshoot Linux Servers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/chatgpt-vs-claude-for-infrastructure/"&gt;ChatGPT vs Claude for Infrastructure Engineers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/best-ai-tools-for-devops-engineers/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>tools</category>
      <category>claude</category>
    </item>
    <item>
      <title>Auditing Kubernetes Manifests With AI: A Practical Workflow</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Tue, 16 Jun 2026 04:31:15 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/auditing-kubernetes-manifests-with-ai-a-practical-workflow-4368</link>
      <guid>https://dev.to/devopsaitoolkit/auditing-kubernetes-manifests-with-ai-a-practical-workflow-4368</guid>
      <description>&lt;p&gt;A senior K8s engineer I work with audits manifests faster than I read them. He's seen so many patterns that "missing readinessProbe on a Deployment that takes 45 seconds to start" jumps off the page. Most of us don't have that pattern library memorized — and increasingly, we don't need to. AI assistants have read more Kubernetes manifests than any human ever will.&lt;/p&gt;

&lt;p&gt;The catch: a generic "review this YAML" prompt produces generic noise. You need to direct the model toward the categories of issues that actually matter in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two mistakes everyone makes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Asking for "a security review."&lt;/strong&gt; You'll get a bullet list of every possible concern, ranked alphabetically, with no signal about which matter. You'll skim, dismiss, and learn nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Pasting one manifest.&lt;/strong&gt; Real Kubernetes problems live in the interaction between resources — a Deployment's readiness probe and a Service's selector, a NetworkPolicy and the actual app traffic. One YAML in isolation hides most of the bugs.&lt;/p&gt;

&lt;p&gt;The fix for both is the same: give the model a &lt;em&gt;bounded scope&lt;/em&gt; and &lt;em&gt;enough context&lt;/em&gt; to reason about interactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  A workflow that works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Pick the audit dimension
&lt;/h3&gt;

&lt;p&gt;Pre-decide what you're checking &lt;em&gt;for&lt;/em&gt;. Different prompts for different dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits &amp;amp; QoS&lt;/strong&gt; — are requests/limits set, does QoS match intent, are limits realistic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probes &amp;amp; lifecycle&lt;/strong&gt; — readiness, liveness, startup, preStop, terminationGracePeriodSeconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security context&lt;/strong&gt; — runAsNonRoot, capabilities, readOnlyRootFilesystem, seccomp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network exposure&lt;/strong&gt; — NetworkPolicy, Service type, Ingress rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability&lt;/strong&gt; — PodDisruptionBudget, topology spread, replica count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State &amp;amp; storage&lt;/strong&gt; — PVC access modes, retention policies, backup tags&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mixing dimensions in one review produces wishy-washy output. Pick one, get a clean answer, move on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Paste the manifest + related context
&lt;/h3&gt;

&lt;p&gt;For a workload review, paste:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Deployment / StatefulSet / DaemonSet&lt;/li&gt;
&lt;li&gt;Its Service(s) and Ingress&lt;/li&gt;
&lt;li&gt;Any NetworkPolicies that match its labels&lt;/li&gt;
&lt;li&gt;The HPA if relevant&lt;/li&gt;
&lt;li&gt;The ConfigMaps and Secrets it references (sanitize first)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For YAML this is usually under 500 lines, well within any model's context window. The model can now reason about interactions, not just isolated fields.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Use a directive prompt
&lt;/h3&gt;

&lt;p&gt;The big difference between "tell me about this YAML" and a useful review is &lt;em&gt;the instruction format&lt;/em&gt;. Compare:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Review this Kubernetes manifest.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;versus:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are reviewing a production Deployment + Service + NetworkPolicy bundle. For each finding, give: (1) severity (critical/high/medium/low), (2) the exact field path that's wrong, (3) one sentence on why it matters, (4) the corrected YAML snippet. Focus only on probes, lifecycle, and graceful shutdown. Ignore documentation/comments.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The first prompt produces an essay. The second produces a list of fixable issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Verify before applying
&lt;/h3&gt;

&lt;p&gt;This is where most reviews go wrong. The model is right &lt;em&gt;most of the time&lt;/em&gt;. It's wrong some of the time, often in ways that look correct.&lt;/p&gt;

&lt;p&gt;Common AI failure modes in K8s review:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated field names&lt;/strong&gt; — &lt;code&gt;spec.template.spec.terminationGracePeriod&lt;/code&gt; (it's &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outdated API versions&lt;/strong&gt; — &lt;code&gt;policy/v1beta1 PodDisruptionBudget&lt;/code&gt; (removed in 1.25)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wrong defaults claimed&lt;/strong&gt; — claiming &lt;code&gt;failureThreshold&lt;/code&gt; defaults to 1 when it's 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Misreading the use case&lt;/strong&gt; — recommending &lt;code&gt;runAsNonRoot: true&lt;/code&gt; for a workload that legitimately needs root&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For every "fix" the model suggests, glance at the official K8s docs for that field. This adds 30 seconds per finding and catches the wrong ones. Without this step, you will apply changes that break things.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Here's a Deployment I reviewed last week:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;payments&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;payments&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.example.com/payments:v3.1.0&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_URL&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres://payments-db:5432/payments&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;/healthz&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;8080&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I asked Claude to review for probes and graceful shutdown only. The findings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;requests&lt;/code&gt;, only &lt;code&gt;limits&lt;/code&gt;&lt;/strong&gt; → pod gets &lt;code&gt;BestEffort&lt;/code&gt; QoS, first to be evicted under pressure. Set requests equal to or below limits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;initialDelaySeconds: 5&lt;/code&gt;&lt;/strong&gt; → Java/Spring apps typically need 30-90 seconds to start. Add &lt;code&gt;startupProbe&lt;/code&gt; with longer threshold.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;livenessProbe&lt;/code&gt;&lt;/strong&gt; → kubelet won't restart if the app deadlocks. Mirror readinessProbe with looser thresholds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt;&lt;/strong&gt; → defaults to 30s; for a payment service with in-flight requests, this is borderline. Set to 60s.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;preStop&lt;/code&gt; hook&lt;/strong&gt; → SIGTERM hits immediately; load balancers may still send traffic for ~10s after pod marked Terminating. Add &lt;code&gt;sleep 15&lt;/code&gt; preStop.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All five were real, all five were fixable in two minutes of YAML editing. The model didn't tell me about anything irrelevant. That's because I scoped the prompt to "probes and graceful shutdown only."&lt;/p&gt;

&lt;p&gt;The big one — #5 — is something I've personally been bitten by twice. The model wouldn't have prioritized it without the directive prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  What about Kyverno / OPA / Pod Security Admission?
&lt;/h2&gt;

&lt;p&gt;Yes, you should run those too. They catch consistent issues at admission time. They don't catch issues that require &lt;em&gt;judgment&lt;/em&gt;: "is 30 seconds enough graceful shutdown for this specific service?" Policy enforcement is a floor; AI review is a directed second opinion above that floor.&lt;/p&gt;

&lt;p&gt;I run both. Kyverno catches "no securityContext at all" before it ever lands. AI review catches "readinessProbe path doesn't match what the app exposes" — something only a human (or an AI imitating one) would notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  A starter prompt
&lt;/h2&gt;

&lt;p&gt;If you want a template, here's the one I use most:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You are reviewing a Kubernetes workload bundle for production readiness. Focus only on: probes (readiness, liveness, startup), &lt;code&gt;terminationGracePeriodSeconds&lt;/code&gt;, preStop hooks, and rolling update strategy. For each finding produce: severity, exact field path, why it matters in one sentence, corrected YAML. Ignore everything else (security context, network policies, resource limits — those are separate reviews). The workload is [serves HTTP at /api on port 8080 / consumes from a queue / batch processor that runs N hours].&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The bracketed context at the end is what makes the review accurate for &lt;em&gt;your&lt;/em&gt; workload. Without it, the model assumes a generic web service.&lt;/p&gt;

&lt;p&gt;For our full prompt library on Kubernetes review, see the &lt;a href="https://dev.to/categories/kubernetes-helm/"&gt;Kubernetes &amp;amp; Helm category&lt;/a&gt; — especially &lt;a href="https://dev.to/prompts/kubernetes-yaml-security-review/"&gt;kubernetes-yaml-security-review&lt;/a&gt; and &lt;a href="https://dev.to/prompts/kubernetes-resource-limits-tuning/"&gt;kubernetes-resource-limits-tuning&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/auditing-kubernetes-manifests-with-ai/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>yaml</category>
      <category>security</category>
    </item>
    <item>
      <title>How to Use Claude to Troubleshoot Linux Servers</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sun, 14 Jun 2026 21:15:59 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/how-to-use-claude-to-troubleshoot-linux-servers-1fhe</link>
      <guid>https://dev.to/devopsaitoolkit/how-to-use-claude-to-troubleshoot-linux-servers-1fhe</guid>
      <description>&lt;p&gt;Claude is genuinely useful for production Linux troubleshooting — when you use it right. Here's the workflow that works, after a year of using it on real incidents across Ubuntu, RHEL, and Rocky.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model: Claude is a senior pair, not an oracle
&lt;/h2&gt;

&lt;p&gt;The mistake most engineers make on day one: they paste a 5-line error message and expect a fix. Claude can do better than that — but only if you give it the same context you'd give a senior engineer joining your incident bridge.&lt;/p&gt;

&lt;p&gt;A senior engineer would want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What OS and version?&lt;/li&gt;
&lt;li&gt;What does this server do?&lt;/li&gt;
&lt;li&gt;What changed recently?&lt;/li&gt;
&lt;li&gt;What's the actual symptom?&lt;/li&gt;
&lt;li&gt;What command output have you already gathered?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Give Claude that, and the quality of analysis changes completely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The workflow
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Establish context with a system prompt
&lt;/h3&gt;

&lt;p&gt;Use our &lt;a href="https://dev.to/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt; as your system prompt, or paraphrase: &lt;em&gt;"You are a senior Linux sysadmin. Rank root-cause hypotheses by probability. Recommend safe diagnostics first. Label destructive commands as DANGEROUS."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Paste structured context, not noise
&lt;/h3&gt;

&lt;p&gt;Good:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;OS: Ubuntu 22.04, kernel 5.15
Role: production MySQL replica, 64GB RAM, 16 cores
Recent changes: kernel upgrade 6 hours ago
Symptom: server load average 40+, MySQL replication lag growing, queries timing out

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;uptime&lt;/span&gt;
&lt;span class="go"&gt; 14:22:01 up 6:02,  4 users,  load average: 41.23, 38.51, 35.04

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="go"&gt;              total        used        free      shared  buff/cache   available
Mem:           62Gi        58Gi       1.2Gi       128Mi       3.1Gi       1.8Gi

&lt;/span&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;iostat &lt;span class="nt"&gt;-xz&lt;/span&gt; 2 3
&lt;span class="go"&gt;[...]
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;my server is slow can you help
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Let it ask follow-up questions
&lt;/h3&gt;

&lt;p&gt;The good prompts in our library tell Claude to &lt;strong&gt;ask for missing data&lt;/strong&gt; before guessing. When it asks "can you share &lt;code&gt;dmesg | tail -50&lt;/code&gt; and &lt;code&gt;vmstat 1 5&lt;/code&gt;?" — that's a feature, not a flaw. Give it the data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Validate suggested commands before running
&lt;/h3&gt;

&lt;p&gt;Claude will sometimes suggest a command with subtly wrong syntax, a destructive flag, or a path that doesn't exist on your distro. Read every suggestion before running. &lt;strong&gt;Never paste straight into a root shell.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Keep the conversation alive
&lt;/h3&gt;

&lt;p&gt;Claude's long context means you can run a 30-minute diagnostic session in one thread, paste new output as you gather it, and the model retains the full diagnostic context. This is the single biggest workflow win versus older AI tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Claude is good at
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Reading command output you don't fully understand (&lt;code&gt;strace&lt;/code&gt;, &lt;code&gt;perf&lt;/code&gt;, &lt;code&gt;tcpdump&lt;/code&gt; summaries).&lt;/li&gt;
&lt;li&gt;Drafting &lt;code&gt;awk&lt;/code&gt;/&lt;code&gt;sed&lt;/code&gt;/&lt;code&gt;grep&lt;/code&gt; one-liners for log analysis.&lt;/li&gt;
&lt;li&gt;Explaining why a specific kernel parameter or sysctl is set.&lt;/li&gt;
&lt;li&gt;Suggesting &lt;em&gt;what to look at next&lt;/em&gt; when you're stuck.&lt;/li&gt;
&lt;li&gt;Drafting the incident summary after you've fixed it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Claude is not good at
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Real-time anything — it can't see your live metrics.&lt;/li&gt;
&lt;li&gt;Distinguishing between two plausible root causes when both fit the symptoms (it'll guess).&lt;/li&gt;
&lt;li&gt;Telling you what's normal for &lt;em&gt;your&lt;/em&gt; environment. You have to provide that baseline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A real-world example
&lt;/h2&gt;

&lt;p&gt;A production server's load average suddenly spiked. Pasting &lt;code&gt;top&lt;/code&gt;, &lt;code&gt;iostat -xz 2 3&lt;/code&gt;, and &lt;code&gt;dmesg | tail -50&lt;/code&gt; into Claude with our prompt template, it immediately flagged: &lt;em&gt;"&lt;code&gt;%iowait&lt;/code&gt; is 78%, &lt;code&gt;await&lt;/code&gt; on /dev/sda is 320ms, and dmesg shows 'task X blocked for more than 120 seconds.' The disk subsystem is saturated, not CPU. Investigate which process is doing heavy I/O: &lt;code&gt;iotop -oP -d1&lt;/code&gt; will show the writer in 1-second intervals."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That's exactly the diagnosis we wanted, framed with the evidence — in seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Companion resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/prompts/bash-script-code-review/"&gt;Bash Script Code Review Prompt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/ai-safely-with-bash/"&gt;How to Use AI Safely with Bash Commands&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/claude-linux-troubleshooting/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>claude</category>
      <category>linux</category>
      <category>troubleshooting</category>
    </item>
    <item>
      <title>How to Choose the Right DevOps as a Service Provider</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Sat, 13 Jun 2026 19:26:40 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/how-to-choose-the-right-devops-as-a-service-provider-466l</link>
      <guid>https://dev.to/devopsaitoolkit/how-to-choose-the-right-devops-as-a-service-provider-466l</guid>
      <description>&lt;p&gt;I've spent 25 years building, breaking, and scaling production infrastructure — long enough to watch "DevOps" go from a conference buzzword to a thing companies now rent by the month. That shift is real, and for a lot of teams it's the right call. But the gap between a great DevOps as a Service provider and a bad one is enormous, and the marketing pages all read the same.&lt;/p&gt;

&lt;p&gt;So this is the article I wish more buyers had: what DevOps as a Service actually means, when it beats hiring, and how to tell — before you sign — whether the people you're talking to have ever been on-call at 3am.&lt;/p&gt;

&lt;h2&gt;
  
  
  What DevOps as a Service actually means
&lt;/h2&gt;

&lt;p&gt;DevOps as a Service (DaaS) is outsourcing the engineering function that builds and runs your delivery pipeline and infrastructure, rather than hiring that function in-house. A provider takes ownership of some or all of: your CI/CD, your cloud environments, your observability, your automation, and the on-call response when something breaks.&lt;/p&gt;

&lt;p&gt;It is not a single tool, and it is not a one-time project. A consultancy that drops a Terraform repo and disappears is not DaaS. The "as a Service" part means there's an ongoing operational relationship — someone is responsible for your systems on Tuesday at 2am, not just during the engagement.&lt;/p&gt;

&lt;p&gt;Done well, you get the output of a seasoned platform team — Linux fundamentals, Kubernetes, Docker, infrastructure-as-code, pipelines, monitoring — without carrying that whole team on payroll.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why companies outsource DevOps instead of hiring
&lt;/h2&gt;

&lt;p&gt;Hiring a full in-house DevOps team is the "right" answer that's often the wrong answer in practice. Here's why teams rent it instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost.&lt;/strong&gt; A single senior DevOps/SRE hire in a competitive market is expensive — and you need more than one for real on-call coverage. Add recruiting time, ramp-up, benefits, and the risk of a bad hire, and the fully-loaded number gets large fast. A provider amortizes senior talent across clients, so you pay for the expertise without paying for the bench.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speed to maturity.&lt;/strong&gt; A good provider has already built the Terraform modules, the GitLab CI templates, the Prometheus alert libraries, the backup runbooks. You're buying an opinionated, battle-tested baseline instead of inventing it. That can compress a year of platform work into weeks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-call coverage.&lt;/strong&gt; Sustainable 24/7 on-call needs roughly six to eight engineers in a healthy rotation. Most companies under a certain size simply cannot staff that without burning people out. Providers spread the rotation across a larger team, so nobody's carrying a pager every single night.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard-to-hire seniority.&lt;/strong&gt; The engineers who can debug a gnarly Kubernetes networking issue, reason about etcd, and also write clean Terraform are rare and they know it. They're hard to attract and harder to retain at a non-tech company. DaaS is often the only realistic way for a mid-sized business to get that caliber of person near its infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's usually included
&lt;/h2&gt;

&lt;p&gt;Scope varies, but a full-spectrum provider should be able to own all of these. When you evaluate one, map their offering against this list and find out exactly where the lines are.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD&lt;/strong&gt; — pipeline design, build/test/deploy stages, and crucially, a real rollback path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud infrastructure&lt;/strong&gt; — provisioning and managing your environments as code (Terraform or equivalent), with sane network and IAM design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring and observability&lt;/strong&gt; — Prometheus, Grafana, logs, and alert rules that page a human only when a human is actually needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt; — configuration management with Ansible, scripted runbooks, and elimination of manual toil.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; — secrets management, least-privilege access, patching, and image scanning baked into the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident response&lt;/strong&gt; — a defined process, on-call rotation, and blameless postmortems, not just "we'll look at it."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backups and disaster recovery&lt;/strong&gt; — and, more importantly, tested restores. A backup you've never restored is a rumor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; — right-sizing, autoscaling, spot/reserved strategy, and killing the zombie resources nobody owns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Questions to ask before you hire a provider
&lt;/h2&gt;

&lt;p&gt;This is the part that separates the real operators from the slide decks. Don't ask "do you do Kubernetes?" — everyone says yes. Ask for specifics and watch how fast and how concretely they answer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;"Show me your Terraform module structure and how you handle state."&lt;/strong&gt; Real teams have an opinion about remote state, locking, workspace-vs-directory layout, and blast-radius isolation. Vague answers here mean they're winging your infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Walk me through a real GitLab CI pipeline you run, including the rollback path."&lt;/strong&gt; A deploy story with no rollback story is half a pipeline. I want to hear how they revert a bad release in minutes, not hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How do you wire Prometheus alert rules to avoid pager fatigue?"&lt;/strong&gt; The right answer involves symptom-based alerting, &lt;code&gt;for:&lt;/code&gt; durations, severity routing, and ruthless deletion of noisy alerts. If every blip pages everyone, nobody responds to the one that matters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What does your on-call rotation look like, and what's your real response time?"&lt;/strong&gt; Get the rotation size, escalation policy, and the SLA in writing. "We're very responsive" is not an SLA.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How do you manage secrets and access?"&lt;/strong&gt; Listen for a vault, short-lived credentials, and least privilege — not secrets in environment files or a shared password manager.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"When did you last test a restore from backup, and how long did it take?"&lt;/strong&gt; The hesitation tells you everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"How do you handle configuration drift?"&lt;/strong&gt; Ansible, immutable images, drift detection — there should be a system, not heroics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"What happens to our infrastructure if we leave you?"&lt;/strong&gt; A confident provider hands you clean, documented IaC and walks away gracefully. Lock-in is a choice they make, and you should know it up front.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;"Who specifically will be on our account, and what's their production background?"&lt;/strong&gt; You're buying judgment. Find out whose judgment.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Red flags to avoid
&lt;/h2&gt;

&lt;p&gt;A few patterns that, in my experience, reliably predict pain.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Buzzword density with no specifics.&lt;/strong&gt; If they can't move from "we leverage cloud-native synergies" to "here's how we structure a Helm chart" in one question, walk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No rollback story.&lt;/strong&gt; Anyone can deploy. Operators can un-deploy under pressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ClickOps in the cloud console.&lt;/strong&gt; If they're configuring your production environment by hand instead of in code, you have no reproducibility and no audit trail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Everything is "automated by AI."&lt;/strong&gt; AI helps. AI does not own your incident at 2am. A provider hiding thin staffing behind AI claims is a serious risk (more on this below).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert noise as a feature.&lt;/strong&gt; Hundreds of alerts is not observability; it's a team that's trained itself to ignore the dashboard.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No postmortems, or blame-heavy ones.&lt;/strong&gt; A team that doesn't write honest postmortems isn't learning, and you'll pay for the same outage twice.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;They won't show you anything real.&lt;/strong&gt; Sanitized examples are fine. "We can't show you any of our work" usually means there isn't much to show.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep lock-in by design.&lt;/strong&gt; Proprietary wrappers around standard tools, undocumented infra, contracts that punish leaving — all signs they're protecting revenue, not your uptime.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why real production experience beats buzzwords
&lt;/h2&gt;

&lt;p&gt;Here's the thing the marketing won't tell you: tools are easy, judgment is hard. Anyone can &lt;code&gt;terraform apply&lt;/code&gt;. The value is in the engineer who knows &lt;em&gt;not&lt;/em&gt; to apply at 4:55pm on a Friday, who recognizes the failure mode three layers down, who's restored a database under pressure and remembers exactly how it went wrong last time.&lt;/p&gt;

&lt;p&gt;That judgment only comes from having run real production systems and felt the consequences. When you evaluate a provider, you're not really buying their Kubernetes skills — those are table stakes. You're buying scar tissue. You want the team that's debugged the keepalived VIP flap, the etcd disk-pressure cascade, the Docker layer that quietly doubled image size and blew out the build cache. Ask for war stories. The good ones light up; the pretenders get vague.&lt;/p&gt;

&lt;h2&gt;
  
  
  How AI fits — and where it doesn't
&lt;/h2&gt;

&lt;p&gt;I'm bullish on AI in DevOps, and I build with it daily. Used right, it's a genuine force multiplier: it can summarize a wall of logs faster than any human, draft Terraform and Ansible boilerplate, propose PromQL, correlate a timeline of "what changed," and write the first pass of a postmortem. That's real leverage, and a modern provider should be using it.&lt;/p&gt;

&lt;p&gt;But there's a hard line, and it's the same one I draw on my own systems: &lt;strong&gt;AI reads and reasons; humans run commands.&lt;/strong&gt; During an active incident, AI proposes a risk-classified, safest-first plan and a human executes every step. The model never touches production. If a provider tells you their AI auto-remediates your prod environment unattended, that's not maturity — that's an outage waiting for a confident-but-wrong suggestion.&lt;/p&gt;

&lt;p&gt;The right framing is AI as a very fast, very well-read junior engineer sitting next to a senior who owns the keyboard. It compresses the slow parts of the work without replacing the judgment that keeps you up. If you want to see what that looks like in practice, our &lt;a href="https://dev.to/dashboard/incident-response/"&gt;AI incident-response workflows&lt;/a&gt; and &lt;a href="https://dev.to/prompts/"&gt;prompt library&lt;/a&gt; are built around exactly that human-in-the-loop principle.&lt;/p&gt;

&lt;p&gt;So when you evaluate a provider's AI claims, ask the same question you'd ask about any tool: where's the human, and what's the blast radius if the AI is wrong?&lt;/p&gt;

&lt;h2&gt;
  
  
  How a good provider actually pays for itself
&lt;/h2&gt;

&lt;p&gt;The reason this model works isn't just cheaper labor — it's better outcomes in three places that show up directly on your books.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It saves money.&lt;/strong&gt; Cost optimization is continuous work most teams never get to: right-sizing nodes, tuning autoscaling, buying reserved capacity, deleting orphaned volumes and idle environments. A provider doing this routinely often saves more on cloud spend than they cost. The infrastructure-as-code discipline also prevents the expensive mistakes — the hand-clicked resource nobody can reproduce, the security group left wide open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It reduces downtime.&lt;/strong&gt; Better alerting means you catch degradation before customers do. Tested restores mean a disaster is an inconvenience, not a company-ending event. A defined incident process with real on-call coverage means the response starts in minutes. Downtime is one of the most expensive things a business buys without meaning to, and maturity here directly buys it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It speeds up deployments.&lt;/strong&gt; A solid GitLab CI pipeline with automated testing and a clean rollback path turns deploys from a scary quarterly event into a boring daily one. Teams that deploy confidently ship faster, and shipping faster is usually the whole point. The fastest way to slow down engineering is to make every release terrifying; good DevOps makes it dull.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go from here
&lt;/h2&gt;

&lt;p&gt;Be honest with yourself about where your infrastructure actually stands. Can you deploy and roll back in minutes, or does a release ruin someone's afternoon? Do your alerts mean something, or has your team learned to ignore them? If your primary database died right now, do you know — not hope, know — that you can restore it? Is there a real on-call rotation, or one exhausted person who's secretly the single point of failure?&lt;/p&gt;

&lt;p&gt;If those questions made you wince, you're not behind — you're normal. Most teams are running far less maturity than they think, and trying to close that gap by hiring slowly, one expensive senior at a time, while production keeps moving. DevOps as a Service exists precisely so you don't have to win that hiring war before you can move fast.&lt;/p&gt;

&lt;p&gt;Take an honest inventory this week. Score yourself on pipelines, observability, incident response, and recovery. Wherever you find a gap that's quietly costing you money, downtime, or velocity, that's where a good provider earns their fee many times over. The teams that move fastest aren't the ones with the most engineers — they're the ones who got serious about maturity before the outage forced the conversation. Decide which kind you want to be, and move while it's still your choice.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Evaluate any provider against your own systems and constraints. The right answer depends on your scale, your risk tolerance, and how much production maturity you already have in-house.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/how-to-choose-devops-as-a-service-provider/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>manageddevops</category>
      <category>cicd</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>ChatGPT vs Claude for Infrastructure Engineers</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Thu, 11 Jun 2026 15:21:51 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/chatgpt-vs-claude-for-infrastructure-engineers-7j5</link>
      <guid>https://dev.to/devopsaitoolkit/chatgpt-vs-claude-for-infrastructure-engineers-7j5</guid>
      <description>&lt;p&gt;Both ChatGPT and Claude are excellent. But they have different strengths, and infrastructure engineers feel those differences more than most users — because we deal with long logs, multi-file configurations, and operations where being &lt;em&gt;almost right&lt;/em&gt; can mean being very wrong.&lt;/p&gt;

&lt;p&gt;Here's a side-by-side from a year of daily use on real infrastructure work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long-context reasoning over logs and manifests
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: Claude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude's long context window means you can paste a 2,000-line &lt;code&gt;kubectl describe pod&lt;/code&gt;, the full Deployment manifest, and your last 50 events without losing fidelity. ChatGPT can handle long contexts too, but in practice it's more likely to summarize or "forget" earlier details mid-conversation.&lt;/p&gt;

&lt;p&gt;For diagnostic workflows where you keep pasting more output as you gather it, Claude's behavior is meaningfully better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety with destructive commands
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: Claude (slightly).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without explicit prompting, Claude is more likely to flag destructive commands (&lt;code&gt;rm -rf&lt;/code&gt;, &lt;code&gt;DROP TABLE&lt;/code&gt;, &lt;code&gt;nova reset-state&lt;/code&gt;, &lt;code&gt;kubectl delete&lt;/code&gt;) with caveats. ChatGPT will too — but is more likely to just hand you the command without extra emphasis.&lt;/p&gt;

&lt;p&gt;If you use either tool in production troubleshooting, &lt;strong&gt;bake the safety constraints into your prompt&lt;/strong&gt; (our prompt library does this). Don't rely on default behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code generation: Ansible, Terraform, Bash, Python
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Roughly tied. Different defaults.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT&lt;/strong&gt; tends toward more "modern" Terraform (newer providers, recent syntax) and is slightly faster to produce a working playbook from scratch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; tends toward more cautious, conventional output with better comments and more attention to idempotency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For infrastructure-as-code review, Claude usually catches more subtle issues. For first-draft generation, ChatGPT is often a hair faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  PromQL and observability queries
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Roughly tied.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Both can write correct PromQL with &lt;code&gt;rate()&lt;/code&gt;, &lt;code&gt;histogram_quantile()&lt;/code&gt;, and label aggregation. Both occasionally hallucinate metric names if you don't paste your &lt;code&gt;/metrics&lt;/code&gt; output. The deciding factor is your prompt quality, not the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Postmortem drafting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: Claude.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude's prose is consistently more readable, less marketing-flavored, and more naturally blameless. ChatGPT tends to slip into corporate phrasing that engineers find grating ("leveraged our learnings to enhance reliability").&lt;/p&gt;

&lt;h2&gt;
  
  
  Ecosystem and integrations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Winner: ChatGPT.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Far larger ecosystem of plugins, GPTs, and shared prompts. If you want a tool that integrates with everything else you use, ChatGPT wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;p&gt;Both are roughly comparable for individual use. Both offer free tiers with rate limits. Teams pricing varies by org needs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which should you use?
&lt;/h2&gt;

&lt;p&gt;The honest answer: &lt;strong&gt;both, for different tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude&lt;/strong&gt; for diagnostic sessions, postmortems, sensitive prod work, and IaC review.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT&lt;/strong&gt; for fast scaffolding, plugin-heavy workflows, and broad community templates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can only pick one and you do mostly production troubleshooting, pick Claude. If you can only pick one and you do mostly greenfield IaC scaffolding, ChatGPT is fine — your prompt quality matters more than the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Companion resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/best-ai-tools-for-devops-engineers/"&gt;Best AI Tools for DevOps Engineers in 2026&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/blog/claude-linux-troubleshooting/"&gt;How to Use Claude to Troubleshoot Linux Servers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/prompts/linux-server-troubleshooting/"&gt;Linux Server Troubleshooting Prompt&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/chatgpt-vs-claude-for-infrastructure/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>chatgpt</category>
      <category>claude</category>
      <category>comparison</category>
    </item>
    <item>
      <title>How DevOps Engineers Can Use AI to Triage Production Incidents Faster</title>
      <dc:creator>James Joyner</dc:creator>
      <pubDate>Mon, 08 Jun 2026 19:49:29 +0000</pubDate>
      <link>https://dev.to/devopsaitoolkit/how-devops-engineers-can-use-ai-to-triage-production-incidents-faster-3jb6</link>
      <guid>https://dev.to/devopsaitoolkit/how-devops-engineers-can-use-ai-to-triage-production-incidents-faster-3jb6</guid>
      <description>&lt;p&gt;The pager goes off at 02:14. Checkout latency is up, error rate is climbing, and you have three dashboards, a wall of logs, and a half-awake brain. The fix, once you know what's wrong, is usually fast. The expensive part is the triage — the first fifteen minutes of "what is actually broken, and what changed?"&lt;/p&gt;

&lt;p&gt;That triage window is exactly where AI helps most, and exactly where it's most dangerous if you let it run commands. This is how to use it to go faster without handing it the keys to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rule that makes AI safe during an incident
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI reads and reasons. Humans run commands.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;During an active incident you are sleep-deprived and time-pressured — the worst possible state to paste a command you don't fully understand. So draw a hard line: AI is allowed to look at evidence and propose a plan. It is never allowed to execute anything. Every command it suggests goes through your eyes and your hands.&lt;/p&gt;

&lt;p&gt;In practice that means you treat the model like a very fast, very well-read junior SRE sitting next to you: it can summarize, correlate, hypothesize, and draft commands — but you're the one with the keyboard, and you read each command before it runs.&lt;/p&gt;

&lt;p&gt;If you only take one thing from this article, take that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Turn the firehose into a summary
&lt;/h2&gt;

&lt;p&gt;The first thing AI is genuinely great at is reading more text than you can at 2am. Paste in the raw material and ask for structure, not answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The firing alerts (name, severity, labels, duration)&lt;/li&gt;
&lt;li&gt;A representative slice of error logs&lt;/li&gt;
&lt;li&gt;Recent deploy / change history&lt;/li&gt;
&lt;li&gt;The relevant dashboard values (p99 latency, error rate, saturation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then prompt it deliberately:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Here are the alerts, logs, and recent changes for an active production incident. Summarize what's happening in 5 bullets, list the top 3 hypotheses ordered by likelihood, and for each hypothesis give me the single read-only command that would confirm or rule it out. Do not suggest any command that changes state."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That last sentence matters. Left unconstrained, models love to suggest &lt;code&gt;kubectl rollout restart&lt;/code&gt; as step one. You want the diagnostics first.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Make it order commands by blast radius
&lt;/h2&gt;

&lt;p&gt;A good incident AI prompt forces a risk classification on every suggested command. Ask it to label each one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;safe&lt;/strong&gt; — pure read-only: &lt;code&gt;kubectl get&lt;/code&gt;, &lt;code&gt;journalctl&lt;/code&gt;, &lt;code&gt;ss&lt;/code&gt;, &lt;code&gt;ip&lt;/code&gt;, &lt;code&gt;cat&lt;/code&gt;, &lt;code&gt;grep&lt;/code&gt;, &lt;code&gt;promtool query&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;caution&lt;/strong&gt; — shells in or makes a small change: &lt;code&gt;kubectl exec&lt;/code&gt;, &lt;code&gt;docker exec&lt;/code&gt;, editing non-prod config&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;destructive&lt;/strong&gt; — restarts, deletes, scale-to-zero, firewall changes, migrations, restores&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then it must order them safest-first. You work top-down and you stop the moment you have a diagnosis. The number of incidents that get &lt;em&gt;worse&lt;/em&gt; because someone reached for a destructive "fix" before confirming the cause is depressingly high — a forced safest-first ordering is a cheap guardrail against that.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tip: keep your standard incident prompt in a snippet manager or a prompt library so you're not authoring it at 2am. We keep a set of &lt;a href="https://dev.to/categories/incident-response/"&gt;AI incident-response prompts&lt;/a&gt; for exactly this.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 3: Correlate "what changed" automatically
&lt;/h2&gt;

&lt;p&gt;Most incidents are caused by a change. The model is good at lining up a timeline if you give it the raw inputs: the alert start time, the last few deploys, config changes, and infra events. Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The latency spike started at 02:09 UTC. Here is the deploy log and the config-change history for the last 6 hours. What changed closest to 02:09, and what's the mechanism by which it could cause this symptom?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where AI routinely beats a tired human: it doesn't get tunnel vision on the service you &lt;em&gt;think&lt;/em&gt; is the problem. It will notice the keepalived VIP change, the connection-pool tweak, or the cert that rotated — the boring change three layers down that you'd have found 20 minutes later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Draft comms while you investigate
&lt;/h2&gt;

&lt;p&gt;Incident comms are a tax you pay in attention you don't have. Hand them to the model:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Write a status-page update for a degraded-checkout incident, customer-facing, no internal jargon, no root cause speculation, ~3 sentences. Then write a one-line internal update for the incident channel with current severity and what we're checking."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You get a customer update and an internal update in seconds, both in the right register. You skim, adjust a word, post. The investigation never stops to write prose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Let it draft the postmortem from the timeline
&lt;/h2&gt;

&lt;p&gt;When the incident is resolved, the timeline is freshest and you're most likely to actually write it down. Paste the incident-channel scrollback and the command history and ask for a blameless postmortem draft: summary, timeline, root cause, impact, what went well, what to improve, and action items. You're editing a draft instead of facing a blank page — which is the difference between a postmortem that gets written and one that doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What NOT to do
&lt;/h2&gt;

&lt;p&gt;A few failure modes worth naming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't paste secrets.&lt;/strong&gt; Scrub tokens, passwords, internal hostnames, and customer data before anything goes into a model. Treat the prompt like a screenshot you might accidentally post in a public channel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't let it invent metrics.&lt;/strong&gt; If you ask for PromQL and you haven't given it your real metric names, it will confidently make them up. Give it your metric names or tell it to use clearly-marked placeholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust a confident command.&lt;/strong&gt; "Confident" and "correct" are unrelated in language models. The safest-first ordering exists precisely so a wrong-but-confident suggestion is read-only.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't skip the human review for "obvious" fixes.&lt;/strong&gt; The obvious fix at 2am is how the incident gets a second act.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where this fits in your workflow
&lt;/h2&gt;

&lt;p&gt;You don't need a platform to start — a saved prompt and a scratch buffer get you most of the value tonight. The structure is what matters: summarize the firehose, hypothesize with read-only confirmations, correlate the timeline, draft the comms, and let the human run every command.&lt;/p&gt;

&lt;p&gt;If you want the structured version of this flow — paste your symptoms and logs, get a risk-classified, safest-first plan plus a postmortem draft — that's exactly what we built the &lt;a href="https://dev.to/dashboard/incident-response/"&gt;AI Incident Response Assistant&lt;/a&gt; for. But the technique stands on its own. Steal the prompts, keep the human on the keyboard, and reclaim the first fifteen minutes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Generated incident plans and commands are assistive, not authoritative. Always verify recommendations against your own systems before running anything in production.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This article was originally published on &lt;a href="https://devopsaitoolkit.com/blog/how-devops-engineers-can-use-ai-to-triage-production-incidents-faster/" rel="noopener noreferrer"&gt;DevOps AI ToolKit&lt;/a&gt; — practical AI workflows for cloud engineers.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>incidentresponse</category>
      <category>ai</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
