James Joyner for DevOps AI ToolKit

Posted on Jun 26 • Originally published at devopsaitoolkit.com

DevOps Security Best Practices Every Engineering Team Should Follow

#devops #security #devsecops #hardening

I've spent 25 years securing Linux boxes, cloud accounts, CI/CD pipelines, and production clusters. The single most consistent lesson across all of it is this: the teams that get breached aren't the ones who lacked a security department. They're the ones who treated security as something a separate department would handle later.

Security is not a phase. It's not a gate at the end of the pipeline, and it's not a quarterly audit. It's a property of how you write infrastructure code, manage secrets, ship containers, and run production every single day. When security lives inside the daily workflow — in the merge request, the pipeline stage, the Terraform plan — it costs almost nothing. When it lives in a separate review at the end, it's expensive, late, and routinely skipped.

This is the checklist I'd hand a new engineering team. Everything here is defensive: hardening, detection, and recovery. Work through it section by section.

Why DevOps security belongs in the daily workflow

The whole premise of DevOps was to stop throwing work over the wall between dev and ops. Security is the last wall standing in most orgs, and it has to come down the same way: by moving the controls into the tools engineers already use.

Treat every pull/merge request as a security review surface, not just a code review.
Run security checks as pipeline stages that fail the build, not as advisory reports nobody reads.
Make the secure path the easy path — a hardened base image, a vetted Terraform module, a secrets helper — so engineers don't route around it.
Assign a security owner per service, not a security team for the whole company. Ownership beats oversight.
Measure mean-time-to-remediate for vulnerabilities the same way you measure deploy frequency.

If a control only exists in a wiki page, it doesn't exist. If it exists in the pipeline, it's real.

Secure access control and least privilege

Most incidents I've cleaned up came down to one over-privileged credential. Least privilege is boring and it's the highest-leverage thing on this list.

Default every IAM role, Kubernetes ServiceAccount, and Linux user to zero permissions, then add only what's needed.
Replace standing admin access with just-in-time elevation that expires automatically.
Scope cloud roles to specific resources and actions — no *:* policies, ever.
In Kubernetes, use RBAC Roles bound to namespaces rather than ClusterRole bindings wherever possible.
Separate human identities from machine identities. Humans get SSO; services get workload identity.
Audit who can sudo, who's in the docker group (that's root-equivalent), and who holds cloud admin — quarterly, in writing.

SSH key management and MFA

SSH is still how a huge amount of production gets touched, and it's still where credential hygiene quietly rots.

Disable password authentication entirely: PasswordAuthentication no and PermitRootLogin no in sshd_config.
Use per-user keys, never a shared key passed around in a chat thread.
Prefer short-lived SSH certificates from a CA over long-lived static keys; rotate the rest on a schedule.
Put a bastion/jump host in front of production and log every session through it.
Require MFA on every identity provider, VPN, and cloud console — phishing-resistant (WebAuthn/hardware keys) for anyone with production access.
Pull keys for departed team members the same day, and audit authorized_keys files for orphans.

Secrets management: API keys, passwords, and tokens

The fastest way to leak a secret is to commit it. The second fastest is to print it. Both are entirely preventable.

Never store secrets in git — not in code, not in .env, not in a "temporary" YAML file. Add a pre-commit secret scanner (gitleaks or trufflehog) to block it.
Centralize secrets in a real secrets manager: HashiCorp Vault, a cloud secrets manager, or equivalent.
For Kubernetes, use Sealed Secrets or an external-secrets operator so the cluster pulls from Vault at runtime — plain Secret objects are only base64, not encrypted.
Give every secret a rotation policy and an owner. Static credentials that never rotate are time bombs.
Inject secrets as runtime environment values or mounted files, not baked into container images or Terraform state.
Scan your git history, not just the current tree — a secret deleted in HEAD is still in the log until you rotate it.

CI/CD pipeline security

Your pipeline has credentials to everything. That makes it one of the highest-value targets you own, and it's frequently the least hardened.

Protect your main branches: require reviews, status checks, and signed commits before merge.
Mark CI/CD variables as protected and masked so they're only exposed on protected branches and never echoed to logs.
In GitLab CI, scope variables to environments and never echo a secret — masking helps, but the discipline of not printing it is what saves you.
Replace long-lived cloud keys in CI with short-lived credentials via OIDC. Let the pipeline exchange its identity for a temporary, scoped token instead of holding a static AWS_SECRET_ACCESS_KEY.
Pin and review your pipeline dependencies — third-party CI templates and actions run with your pipeline's privileges.
Require manual approval for production deploys, and make the deploy job itself least-privileged.

A leaked CI variable is a leaked production credential. Treat the pipeline config with the same care you'd treat root.

Container image scanning

A container is only as trustworthy as the layers underneath it. Most images ship with known CVEs the team never looked at.

Scan every image with Trivy (or Grype) as a GitLab pipeline stage before push, and fail the build on high/critical findings:

  container_scan:
    stage: test
    image: aquasec/trivy:latest
    script:
      - trivy image --exit-code 1 --severity HIGH,CRITICAL "$IMAGE_TAG"

Start from minimal base images (distroless or slim) to shrink the attack surface.
Run containers as a non-root user (USER in the Dockerfile) with a read-only root filesystem where possible.
Drop all Linux capabilities and add back only what's required.
Pin base images by digest, not by floating :latest tags, and rebuild regularly to pick up patches.
Sign images and verify signatures at admission so only your builds run in your cluster.

Infrastructure as Code security

IaC is where a one-line mistake becomes a fleet-wide misconfiguration. The good news: it's also where automated policy catches it before it ships.

Review Terraform and Ansible changes like application code — every change goes through a merge request with a human reviewer.
Run static analysis on IaC in the pipeline: tfsec/Checkov for Terraform, ansible-lint and kube-linter for the rest.
Adopt policy-as-code (OPA/Conftest or Sentinel) so rules like "no public S3 buckets" and "no 0.0.0.0/0 on port 22" are enforced automatically, not remembered by reviewers.
Protect and encrypt Terraform state — it contains secrets in plaintext. Use a remote backend with locking and access controls.
For Ansible, encrypt sensitive variables with Vault and avoid become where it isn't needed.
Diff the plan before every apply and require approval for changes to security groups, IAM, and networking.

If you want a structured second opinion on a risky module, an automated infrastructure code review catches the misconfigurations a tired reviewer skims past at the end of the day.

Patch management and vulnerability remediation

Unpatched systems are the most common root cause of breaches, and the least glamorous to fix. Make it routine so it isn't a decision.

Automate OS patching with unattended security updates on Linux, and rebuild container images on a cadence rather than letting them age.
Track your dependencies with SBOMs so you can answer "are we affected?" the day a CVE drops.
Subscribe to advisories for your stack and define an SLA: criticals patched in days, highs in a week or two.
Use Dependabot/Renovate to open dependency-bump PRs automatically and run them through your test suite.
Keep an inventory of every host, image, and cluster version — you can't patch what you don't know you run.

Monitoring, logging, and alerting for security events

You cannot respond to what you can't see. Detection is the difference between a contained incident and a postmortem that starts with "we think they were in for three months."

Enable Linux auditd and ship /var/log/auth.log, sudo events, and SSH activity to centralized, append-only storage.
Export security metrics to Prometheus and build Grafana dashboards plus alerts for anomalies: failed-login spikes, new sudo grants, unexpected outbound connections, root logins, container escapes.
Alert on auditd/SSH anomalies in Grafana — a burst of failed SSH from a new ASN, or a successful root login outside business hours, should page someone.
Turn on cloud audit logging (CloudTrail or equivalent) and alert on IAM policy changes, new access keys, and security-group edits.
Capture Kubernetes audit logs and alert on exec into production pods and changes to RBAC.
Keep logs immutable and retained long enough to investigate a slow-burn intrusion.

Backup and disaster recovery planning

Ransomware and fat-fingered terraform destroy have the same fix: backups you've actually tested. Untested backups are just hope.

Follow 3-2-1: three copies, two media types, one off-site and offline.
Keep at least one immutable/air-gapped copy that a compromised admin credential can't delete.
Encrypt backups at rest and control who can read and restore them.
Test restores on a schedule. A backup you've never restored is a guess, not a recovery plan.
Document RPO and RTO per service and verify your backup cadence actually meets them.
Back up the things people forget: Terraform state, Vault data, etcd, and database credentials.

Incident response preparation

The middle of an incident is the worst time to figure out your incident process. Prepare while it's calm.

Write a runbook: who's on call, how to declare an incident, where to communicate, and how to reach the right people fast.
Pre-define severity levels and the actions each triggers.
Keep break-glass credentials in a sealed, audited path — available in a crisis, logged when used.
Practice. Run a tabletop or game day at least quarterly so the steps are muscle memory.
Have communication templates ready — customer-facing and internal — so comms don't stall the investigation.
Draft blameless postmortems while the timeline is fresh, and turn action items into tracked work.

AI-assisted security checks — review, never blind trust

AI is genuinely useful for security work: it reads more YAML, Terraform, and logs than you can, and it's fast at spotting the misconfiguration buried in a 400-line diff. The rule is the same one I apply to everything an AI generates — it drafts, you decide.

Use AI to review IaC and pipeline configs for missing controls, over-broad permissions, and risky defaults — as a first-pass reviewer, not the final word.
Have it summarize scanner output and rank findings by real-world exploitability so your team fixes what matters first.
Let it draft hardening changes and policy rules, then read every line before you apply it — confident and correct are not the same thing.
Never paste live secrets, real hostnames, or customer data into a model. Scrub first.
Keep a human approving every change that touches IAM, networking, or production. AI accelerates the work; it doesn't own the risk.

If you want vetted starting points, our security & hardening prompts cover image scanning, Linux hardening, and IaC review, and the broader prompt library has the rest.

The bottom line

None of these practices are exotic. Least privilege, managed secrets, scanned images, policy-checked infrastructure, patched hosts, real monitoring, tested backups, and a rehearsed incident plan — every one is achievable this quarter, and every one is cheaper to build into the workflow than to bolt on after a breach.

And here's the part that doesn't show up on the engineering scorecard: secure DevOps is a competitive advantage. Customers run security questionnaires before they sign. Investors ask about your posture in diligence. Partners won't integrate with a platform they don't trust. The companies that win those deals are the ones who can show, not claim, that protecting customer systems is built into how they work every day.

Security isn't the tax you pay to ship. It's part of why people let you ship to them at all.

This article was originally published on DevOps AI ToolKit — practical AI workflows for cloud engineers.

DEV Community