DevOps engineers operate at the intersection of development speed and operational reliability — managing CI/CD pipelines, cloud infrastructure, monitoring stacks, and on-call incidents often simultaneously. These 35 prompts help DevOps engineers write automation scripts, draft runbooks, review infrastructure-as-code, and communicate technical decisions to non-technical stakeholders.
These copy-paste-ready prompts are tool-agnostic enough to adapt to your stack — whether you're running Kubernetes on AWS, Terraform on GCP, or managing a hybrid on-prem environment.
1. CI/CD Pipeline Design and Troubleshooting
I am building a GitHub Actions CI/CD pipeline for a Node.js microservice that deploys to AWS ECS Fargate. The pipeline should: run unit tests, run ESLint, build a Docker image, push to ECR, and deploy to staging automatically on merge to main. Production deploys require manual approval. Write a complete GitHub Actions YAML workflow file with best practices for secrets management and caching.
My Jenkins pipeline is intermittently failing at the Docker build step with the error "no space left on device" on the build agent. Walk me through the most likely root causes in order of probability and provide specific commands to diagnose and fix each one. Include how to prevent this from recurring.
Help me design a GitLab CI/CD pipeline for a Python Flask application with the following stages: lint (flake8 + black), unit test (pytest with coverage report), SAST (Semgrep), Docker build and push to GitLab Container Registry, deploy to Kubernetes via Helm on merge to main, and Slack notification on failure. Write the complete .gitlab-ci.yml.
I need to implement a blue-green deployment strategy for a critical API service running on AWS ECS. Walk me through the architecture, the specific AWS services involved (ALB, ECS, CodeDeploy), and write the key configuration files needed. Explain how traffic cutover works and how to roll back if the green environment fails health checks.
Our CI pipeline takes 45 minutes to complete for a monorepo with 12 services. Help me identify and implement optimization strategies. Cover: parallelization, Docker layer caching, test splitting, selective builds based on changed files, and artifact reuse between stages. Provide specific GitHub Actions configuration examples for each optimization.
2. Infrastructure as Code (Terraform / CloudFormation)
Write a Terraform module for a production-ready AWS VPC with the following: 3 public subnets and 3 private subnets across 3 AZs, an internet gateway, NAT gateways in each AZ for high availability, route tables, and security group defaults. Follow Terraform best practices including variable definitions, outputs, and resource tagging.
I have the following Terraform code that is failing with the error "Error: Invalid for_each argument — The given 'for_each' argument value is unsuitable." [paste code here]. Explain why this error occurs and rewrite the relevant section to fix it, explaining the correct approach for dynamic resource creation with for_each when the count is not known at plan time.
Help me design a Terraform project structure for a multi-environment (dev, staging, prod) AWS infrastructure with shared modules. Include: recommended directory layout, how to handle environment-specific variables, remote state management with S3 and DynamoDB locking, and workspace vs. separate state files — explain the tradeoffs.
Write a Terraform configuration for an AWS RDS PostgreSQL instance with the following requirements: Multi-AZ deployment, encrypted storage, automatic backups with 7-day retention, parameter group for performance tuning, security group restricting access to the application tier only, and Secrets Manager integration for credentials rotation.
I am migrating existing AWS infrastructure to Terraform using terraform import. Walk me through the process step by step: how to import existing resources, how to write matching configuration, how to handle the state file, and what common pitfalls to watch for. Include examples for importing an EC2 instance and an S3 bucket.
3. Kubernetes and Container Orchestration
Write a production-grade Kubernetes deployment manifest for a Python FastAPI application. Include: Deployment with 3 replicas, resource requests and limits, liveness and readiness probes, rolling update strategy with maxSurge and maxUnavailable, a HorizontalPodAutoscaler targeting 60% CPU, a Service of type ClusterIP, and a ConfigMap for environment configuration.
A pod in my Kubernetes cluster is stuck in CrashLoopBackOff. Walk me through a systematic debugging process: what kubectl commands to run in what order, how to interpret the output, the most common root causes for this error state, and how to fix each one.
Help me implement Kubernetes network policies to enforce a zero-trust security posture in a namespace called "payments" that contains 3 microservices: payment-api, fraud-detector, and audit-logger. Write the NetworkPolicy manifests that: deny all ingress and egress by default, allow only the specific inter-service communication paths needed, and allow egress to the cluster DNS.
I need to set up a complete observability stack on Kubernetes using the kube-prometheus-stack Helm chart. Walk me through: the Helm installation command with key values overrides, how to access Grafana, how to create a custom ServiceMonitor for my application, and how to configure an alerting rule for pod restarts exceeding a threshold. Include the YAML for the ServiceMonitor and PrometheusRule.
Explain the difference between Kubernetes StatefulSets and Deployments, when to use each, and walk me through a complete StatefulSet example for a 3-node Redis cluster with persistent storage, a headless service for stable network identities, and pod disruption budget configuration.
4. Monitoring, Alerting, and Observability
Help me design an alerting strategy for a production e-commerce platform on AWS. I need alerts for: API error rate above 1%, p99 latency above 500ms, pod CPU above 85% for 10 minutes, database connection pool exhaustion, and payment service 5xx errors. Write the alerting rules in Prometheus YAML format and suggest appropriate notification channels and escalation tiers for each.
Write a Grafana dashboard JSON configuration for monitoring a Node.js microservice. Include panels for: request rate (RPS), error rate percentage, p50/p95/p99 latency, active connections, memory usage, CPU usage, and pod restart count. Use the Prometheus datasource and include variable templates for namespace and deployment.
I need to implement distributed tracing for a microservices application. Compare OpenTelemetry, Jaeger, and Zipkin: what problem each solves, how they relate to each other, and write a Python FastAPI instrumentation example using the OpenTelemetry SDK that sends traces to a Jaeger backend.
Help me write a structured logging strategy for a Python microservices application. Cover: log levels and when to use each, the fields every log entry should include (correlation ID, service name, environment, etc.), using structlog for JSON output, how to propagate trace context into logs, and how to configure log aggregation with the ELK stack.
Write a runbook for investigating and resolving high database CPU utilization in a production PostgreSQL RDS instance. Include: initial diagnosis steps, key queries to identify blocking queries and slow queries, how to use pg_stat_statements, when to scale vs. optimize, and how to communicate status to stakeholders during the incident.
5. Security and Compliance
Review the following Dockerfile for security best practices and rewrite it with improvements: [paste Dockerfile]. Specifically check for: running as root, using latest tags, exposed sensitive credentials, unnecessary packages, and image size bloat. Explain each change you make.
Help me implement a secrets management strategy for a Kubernetes-based application. Compare: Kubernetes Secrets (native), AWS Secrets Manager with External Secrets Operator, HashiCorp Vault with the Vault Agent Injector, and Sealed Secrets. For each, explain the security model, operational complexity, and ideal use case. Recommend an approach for a team of 8 engineers managing 20 microservices.
Write a GitHub Actions workflow that integrates security scanning into the CI pipeline. Include: Trivy for container image vulnerability scanning, Checkov for Terraform IaC misconfiguration scanning, OWASP Dependency Check for known vulnerable dependencies, and Semgrep for SAST. Configure the pipeline to fail on critical/high findings and generate a SARIF report uploadable to GitHub Security.
I need to achieve SOC 2 Type II compliance for our AWS infrastructure. Draft an infrastructure security checklist covering the key technical controls across the five trust service criteria: Security, Availability, Processing Integrity, Confidentiality, and Privacy. Focus on AWS-specific controls with the specific service or configuration for each.
Help me write an IAM least-privilege policy for an AWS Lambda function that needs to: read from a specific S3 bucket, write to a specific DynamoDB table, publish to a specific SNS topic, and write CloudWatch Logs. Write the IAM policy JSON with resource-level restrictions and explain why each permission is included.
6. Incident Response and Postmortems
Help me run a structured incident response for a production outage. Our API gateway is returning 502 errors for 40% of requests. Write a step-by-step incident response guide covering: initial triage (first 15 minutes), communication templates for internal Slack and status page, systematic investigation steps, escalation criteria, and resolution verification checklist.
Write a postmortem document template following the blameless postmortem methodology for a 47-minute production database outage caused by a misconfigured Terraform change. Include sections for: incident summary, timeline, root cause analysis (5 Whys), contributing factors, impact assessment, what went well, what went poorly, and action items with owners and due dates.
I need to write an incident communication to customers about a 2-hour partial outage affecting file upload functionality for our SaaS platform. Approximately 15% of users were impacted. Write: a status page update (posted during the incident), a resolution update (posted after restoration), and a follow-up email to affected customers for the next day. Tone: transparent, accountable, and confidence-restoring.
Help me build an on-call runbook library structure for a team of 6 engineers managing a distributed system with 15 microservices, an API gateway, PostgreSQL, Redis, and AWS infrastructure. Define the runbook template format, what scenarios need runbooks, how to keep them current, and write a complete example runbook for the scenario: "Redis cache miss rate exceeds 50%."
Write an SLI/SLO/SLA framework for a B2B SaaS API product. Define: 3 appropriate SLIs with their Prometheus query expressions, SLO targets for each with justification, an error budget policy (what happens when the error budget is consumed), and draft SLA language for a customer contract with a 99.9% monthly uptime commitment.
7. Documentation and Team Communication
Write a technical architecture decision record (ADR) for the decision to migrate from a monolithic deployment to a Kubernetes-based microservices architecture. Include: status, context, decision, consequences (positive and negative), alternatives considered, and the decision-making criteria. Target audience: current and future engineers on the team.
Help me write a developer onboarding guide for the local development environment setup for a microservices application. The stack is: Docker Compose for local services, a React frontend, 3 Python FastAPI services, PostgreSQL, Redis, and a Kafka instance. Include prerequisites, step-by-step setup, common troubleshooting, and how to run the test suite.
I need to present a proposed migration from self-managed Kubernetes on EC2 to Amazon EKS to engineering leadership and the CFO. Write a 5-slide presentation outline covering: current pain points, proposed solution, cost analysis (total cost of ownership comparison), migration risk and mitigation, and a 90-day implementation timeline.
Write a capacity planning document for our infrastructure team. We are planning for 3x user growth over the next 12 months. Current infrastructure handles 10,000 requests/minute at 60% CPU. Cover: load testing methodology, scaling trigger thresholds, component-by-component capacity analysis (compute, database, cache, storage), cost projections, and recommended scaling actions.
Draft a team process document for our DevOps team's on-call rotation. Include: rotation schedule logic (6 engineers, weekly rotation), expected response times by severity level, escalation paths, compensation/time-off policy for on-call burden, tooling requirements (PagerDuty, Slack), and a section on psychological safety and sustainable on-call practices.
Get the Complete DevOps Engineer AI Toolkit
Get the complete AI Prompt Toolkit for DevOps Engineers →
Works with Claude, ChatGPT, and DeepSeek. Copy-paste ready.
Top comments (0)