<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevOps Start</title>
    <description>The latest articles on DEV Community by DevOps Start (@devopsstart).</description>
    <link>https://dev.to/devopsstart</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862044%2F9672d1b5-f8fd-4473-998f-30a47c07608f.png</url>
      <title>DEV Community: DevOps Start</title>
      <link>https://dev.to/devopsstart</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devopsstart"/>
    <language>en</language>
    <item>
      <title>Secure Terraform PRs with an Architecture Firewall</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:15:37 +0000</pubDate>
      <link>https://dev.to/devopsstart/secure-terraform-prs-with-an-architecture-firewall-2e4f</link>
      <guid>https://dev.to/devopsstart/secure-terraform-prs-with-an-architecture-firewall-2e4f</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop the 'merge and pray' workflow! This guide was originally published on devopsstart.com and explores how to implement an automated architecture firewall for your Terraform PRs using OPA.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;An architecture firewall is a governance layer integrated into your CI/CD pipeline that automatically blocks infrastructure changes violating security or organizational standards before they reach your environment. Unlike a network firewall that filters packets, this firewall filters Pull Requests (PRs). It transforms your infrastructure requirements from passive documentation in a wiki into active, executable code that cannot be ignored.&lt;/p&gt;

&lt;p&gt;In this article, you will learn how to move beyond the "merge and pray" workflow by implementing Policy as Code (PaC). We will explore the technical bridge between a &lt;code&gt;terraform plan&lt;/code&gt; and automated validation using tools like Open Policy Agent (OPA) and Checkov. You'll discover how to create a pipeline that converts Terraform plans to JSON, evaluates them against strict guardrails and provides immediate feedback to developers via PR comments. By the end, you will have a strategy to enforce encryption, restrict public access and prevent accidental resource deletion without slowing down your engineering velocity. This approach aligns with modern &lt;a href="https://dev.to/blog/terraform-testing-best-practices-beyond-plan-and-pray"&gt;Terraform testing best practices&lt;/a&gt;, ensuring that your cloud footprint remains secure by design rather than by chance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Manual PR Reviews Fail the Architecture Test
&lt;/h2&gt;

&lt;p&gt;Relying solely on human peer reviews to catch security holes is a recipe for a production outage. In high-velocity environments, reviewers suffer from fatigue. When a developer submits a PR with 500 lines of HCL, a reviewer might miss a single &lt;code&gt;0.0.0.0/0&lt;/code&gt; in a security group or a missing &lt;code&gt;encryption_enabled = true&lt;/code&gt; flag on an S3 bucket. Humans are great at reviewing logic and intent, but they are terrible at consistently auditing thousands of lines of configuration against a 50-page security compliance PDF.&lt;/p&gt;

&lt;p&gt;The "Merge and Pray" workflow creates a dangerous gap where "architectural drift" occurs. This happens when the actual state of your cloud deviates from your intended security posture because a few "small" exceptions were merged over time. To solve this, you need an automated gate that operates on the &lt;code&gt;terraform plan&lt;/code&gt; output. This plan is the only source of truth because it represents exactly what Terraform intends to do, accounting for variables, modules and the current state of the cloud.&lt;/p&gt;

&lt;p&gt;For example, if you are using Terraform v1.9.0+, you can generate a machine-readable plan that your architecture firewall can analyze. This removes the ambiguity of reviewing the &lt;code&gt;.tf&lt;/code&gt; files alone, which doesn't show the final resolved values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate the binary plan file&lt;/span&gt;
terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan

&lt;span class="c"&gt;# Convert the binary plan to JSON for policy evaluation&lt;/span&gt;
terraform show &lt;span class="nt"&gt;-json&lt;/span&gt; tfplan &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; tfplan.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By shifting the audit from the code to the plan, you ensure that the firewall sees the final result, not just the intent. This is the foundation of a robust governance strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Policy as Code Engine
&lt;/h2&gt;

&lt;p&gt;To build an architecture firewall, you must choose a Policy as Code (PaC) engine. For simple, industry-standard checks, tools like Checkov or TFLint are excellent because they come with hundreds of pre-built policies. However, for complex organizational logic (such as "Production databases must be deployed in three availability zones and have a specific naming convention"), you need a general-purpose policy engine like Open Policy Agent (OPA). OPA uses a language called Rego to query JSON data.&lt;/p&gt;

&lt;p&gt;The technical flow is straightforward: your CI pipeline runs the plan, converts it to JSON and pipes that JSON into OPA. If the Rego policy returns a "deny" result, the CI pipeline fails and the PR is blocked from merging. This turns your security requirements into a unit test for your infrastructure.&lt;/p&gt;

&lt;p&gt;Below is a practical example of a Rego policy that prevents any AWS S3 bucket from being created without server-side encryption.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;

&lt;span class="ow"&gt;import&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;keywords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;if&lt;/span&gt;

&lt;span class="c1"&gt;# Default allow unless a violation is found&lt;/span&gt;
&lt;span class="ow"&gt;default&lt;/span&gt; &lt;span class="n"&gt;allow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="c1"&gt;# Violation: S3 bucket without encryption&lt;/span&gt;
&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket"&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if server_side_encryption_configuration is missing or empty&lt;/span&gt;
    &lt;span class="c1"&gt;# In Terraform JSON, 'after' contains the planned state&lt;/span&gt;
    &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;after&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;server_side_encryption_configuration&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Security Violation: S3 bucket %s must have encryption enabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run this against your plan in a GitHub Action or GitLab CI runner using OPA v0.60.0, you would execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run OPA evaluation and capture the deny rules&lt;/span&gt;
opa &lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; tfplan.json &lt;span class="nt"&gt;-d&lt;/span&gt; policy.rego &lt;span class="s2"&gt;"data.terraform.analysis.deny"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the output contains any messages, the firewall has triggered and the build should fail. This process provides a mathematical guarantee that no unencrypted bucket ever reaches production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Application: Preventing Production Catastrophes
&lt;/h2&gt;

&lt;p&gt;A common production nightmare is the accidental deletion of a critical resource, such as a primary database or a core VPC, due to a renaming error or a module refactor. A manual reviewer might not realize that changing a resource name in Terraform results in a "destroy and recreate" action. An architecture firewall can catch this by analyzing the &lt;code&gt;actions&lt;/code&gt; array in the &lt;code&gt;terraform plan&lt;/code&gt; JSON.&lt;/p&gt;

&lt;p&gt;By writing a policy that flags any &lt;code&gt;delete&lt;/code&gt; action on resources tagged as &lt;code&gt;critical&lt;/code&gt;, you create a safety net. This doesn't mean you can never delete things; it means you must explicitly acknowledge the risk, perhaps through a "break-glass" label on the PR or a manual override from a Lead Architect.&lt;/p&gt;

&lt;p&gt;Consider this scenario: a developer changes the name of an RDS instance to match a new naming convention. Terraform sees this as deleting the old DB and creating a new one. Without a firewall, the PR looks like a simple string change. With a firewall, the system sees a &lt;code&gt;delete&lt;/code&gt; action on an &lt;code&gt;aws_db_instance&lt;/code&gt; and blocks it.&lt;/p&gt;

&lt;p&gt;Here is how you would implement a "Protection" rule in Rego to block deletions of production databases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="n"&gt;if&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_changes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if the action includes 'delete'&lt;/span&gt;
    &lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;change&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"delete"&lt;/span&gt;

    &lt;span class="c1"&gt;# Only apply this to production environments by checking input variables&lt;/span&gt;
    &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;variables&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;

    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"CRITICAL FAILURE: Attempting to delete production database %s. This action is blocked by the Architecture Firewall."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Integrating this into your workflow requires a tight loop. You can use &lt;a href="https://dev.to/tutorials/how-to-automate-terraform-reviews-with-github-actions"&gt;automation for Terraform reviews&lt;/a&gt; to post these specific error messages directly as comments on the offending line of the PR. This transforms the "No" from the security team into a helpful, automated suggestion from the platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Architecture Guardrails
&lt;/h2&gt;

&lt;p&gt;Implementing a firewall can create friction if handled poorly. If every PR is blocked by 50 different warnings, developers will find ways to bypass the system. Use these strategies to balance security with velocity.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distinguish Between Warnings and Failures&lt;/strong&gt;. Not every policy should block a merge. Use "Advisory" levels for things like "missing cost-center tag" (warning) and "Critical" levels for "open SSH port" (hard fail). This prevents the firewall from becoming a nuisance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Version Your Policies&lt;/strong&gt;. Treat your Rego or Checkov policies like application code. Store them in a separate Git repository, version them and test them against a suite of "known-bad" Terraform plans to ensure no regressions in your security posture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Provide Remediation Guidance&lt;/strong&gt;. A failure message like &lt;code&gt;Policy violation: SEC-01&lt;/code&gt; is useless. Your firewall should return &lt;code&gt;Security Violation: Port 22 is open to 0.0.0.0/0. Please restrict this to the corporate VPN range (10.x.x.x)&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Implement an Exception Process&lt;/strong&gt;. There will always be a legitimate reason to break a rule. Create a standardized way to grant exceptions, such as requiring a specific metadata tag (&lt;code&gt;exception_id = "SEC-123"&lt;/code&gt;) that the policy engine is programmed to ignore.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shift Left with Local Pre-commit Hooks&lt;/strong&gt;. Don't make the CI pipeline the first time a developer sees a failure. Provide a &lt;code&gt;pre-commit&lt;/code&gt; configuration using tools like &lt;code&gt;terraform-docs&lt;/code&gt; and &lt;code&gt;checkov&lt;/code&gt; so they can catch errors on their local machine.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does the architecture firewall replace the need for manual peer reviews?
&lt;/h3&gt;

&lt;p&gt;No, it augments them. The firewall handles the "objective" checks (security, compliance, syntax) so that human reviewers can focus on the "subjective" checks (architecture design, business logic and efficiency). It removes the tedious parts of the review process, allowing engineers to have higher-level discussions about the implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Which tool should I choose: OPA, Checkov, or Sentinel?
&lt;/h3&gt;

&lt;p&gt;If you are using Terraform Cloud/Enterprise, Sentinel is the native choice and offers the deepest integration. If you need a free, industry-standard scanner that works out-of-the-box with minimal configuration, go with Checkov. If you have complex, custom business logic that spans multiple cloud providers and requires a powerful query language, Open Policy Agent (OPA) is the gold standard. I have seen mature platform teams use Checkov for general security and OPA for custom organizational guardrails.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I prevent the firewall from slowing down my deployment pipeline?
&lt;/h3&gt;

&lt;p&gt;Running &lt;code&gt;terraform plan&lt;/code&gt; and &lt;code&gt;opa eval&lt;/code&gt; typically adds less than 60 seconds to a pipeline. To further optimize, you can run these checks in parallel with other tests. Additionally, by implementing local pre-commit hooks, you reduce the number of failed CI runs, meaning the pipeline only handles "clean" code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I use this to manage costs?
&lt;/h3&gt;

&lt;p&gt;Yes, this is a powerful use case. You can write policies that analyze the &lt;code&gt;resource_changes&lt;/code&gt; for expensive instance types. For example, you can block any PR that attempts to spin up an &lt;code&gt;aws_instance&lt;/code&gt; of type &lt;code&gt;p4d.24xlarge&lt;/code&gt; unless the project has a specific "high-compute" approval tag. This prevents "bill shock" by catching expensive mistakes before the resources are actually provisioned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building an architecture firewall is about shifting your mindset from "trusting the reviewer" to "trusting the system." By implementing Policy as Code, you ensure that your security standards are consistently applied across every single PR, regardless of who is reviewing it. This creates a scalable governance model that allows your platform team to support hundreds of developers without becoming a bottleneck.&lt;/p&gt;

&lt;p&gt;To get started, don't try to automate your entire security handbook at once. Start with the "low hanging fruit": block public S3 buckets and open SSH ports. Once your team is comfortable with the automated feedback loop, gradually introduce more complex architectural rules.&lt;/p&gt;

&lt;p&gt;Your next steps are to install OPA or Checkov, integrate a &lt;code&gt;terraform show -json&lt;/code&gt; step into your GitHub Actions or GitLab CI and write your first "deny" rule. This transition from manual oversight to automated guardrails is the defining characteristic of a mature Platform Engineering organization.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>policyascode</category>
      <category>openpolicyagent</category>
      <category>devsecops</category>
    </item>
    <item>
      <title>Local LLM for Log Analysis: Privacy-First Debugging with Ollama</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:05:27 +0000</pubDate>
      <link>https://dev.to/devopsstart/local-llm-for-log-analysis-privacy-first-debugging-with-ollama-361o</link>
      <guid>https://dev.to/devopsstart/local-llm-for-log-analysis-privacy-first-debugging-with-ollama-361o</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop sending sensitive production logs to the cloud. This guide, originally published on devopsstart.com, shows you how to build a privacy-first debugging stack using Ollama and Llama 3.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Sending production logs to cloud AI APIs is a non-starter for any serious SRE in a regulated industry. The answer to maintaining security while gaining AI capabilities is to shift the inference engine to your own hardware. By deploying a local LLM stack using Ollama and Llama 3, you can perform semantic log analysis and root cause diagnosis without a single byte of data leaving your secure perimeter.&lt;/p&gt;

&lt;p&gt;Whether you are in fintech, healthcare, or govtech, the "Compliance Wall" is real. You cannot risk leaking PII, session tokens, or internal IP addresses to a third party, even with "Enterprise" privacy agreements. You can find the fundamental concepts of managing these workloads in the official &lt;a href="https://ollama.com/library" rel="noopener noreferrer"&gt;Ollama documentation&lt;/a&gt;, which provides the framework for running open-source models locally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this take
&lt;/h2&gt;

&lt;p&gt;Most organizations try to solve the privacy problem with PII Redaction scripts before sending logs to a cloud provider. This is a flawed strategy. Regular expressions and basic NER (Named Entity Recognition) models always miss something. A leaked credit card number or a proprietary internal URL in a stack trace can trigger a compliance audit that costs your company millions. The only way to guarantee zero leakage is to ensure the data never leaves the air-gapped environment or the VPC.&lt;/p&gt;

&lt;p&gt;In a production environment with over 500 microservices, the sheer volume of logs makes manual grep-ing impossible. I have seen teams spend six hours correlating logs across three different namespaces just to find a single timeout. A local LLM, when fed a curated slice of logs, can identify the behavioral pattern of a failure in seconds. For example, a sequence of 200 OK responses that occur in an impossible order often indicates a logic bug that regex-based monitors will never catch.&lt;/p&gt;

&lt;p&gt;Consider the operational reality of a CrashLoopBackOff. Instead of manually running &lt;code&gt;kubectl logs&lt;/code&gt; and &lt;code&gt;kubectl describe&lt;/code&gt; and trying to map them in your head, you can pipe the output directly into a local model. When you are &lt;a href="https://dev.to/blog/how-to-fix-kubernetes-crashloopbackoff-in-production"&gt;Fixing Kubernetes CrashLoopBackOff in Production&lt;/a&gt;, the bottleneck is usually the cognitive load of parsing verbose Java or Go stack traces. A local LLM reduces this load by summarizing the failure point immediately.&lt;/p&gt;

&lt;p&gt;The cost of cloud tokens for log analysis is astronomical. Logs are verbose. If you send 10MB of logs to a cloud LLM for every incident, your monthly bill will skyrocket. Running a 7B or 8B parameter model on a dedicated GPU node costs nothing but the electricity and the initial hardware investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  The strongest counter-argument
&lt;/h2&gt;

&lt;p&gt;The most common pushback against local LLMs is the "Hardware Tax." Critics argue that the VRAM requirements for acceptable performance are too high for a standard developer laptop or a typical DevOps jump box. It is true that running a 70B parameter model requires multiple A100s or H100s to be performant, which is an unreasonable ask for a local debugging setup. If you try to run a large model on a CPU with 16GB of RAM, the tokens per second will be so slow that you might as well go back to using &lt;code&gt;grep&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There is also the issue of Context Window limitations. A production log file can be several gigabytes, while most local models have a context window ranging from 8k to 128k tokens. You cannot simply upload a log file to Ollama and ask what happened. You have to implement a pre-processing pipeline to slice the logs, filter out the noise, and feed the model only the relevant window surrounding the timestamp of the error. This adds architectural complexity that a simple API call to OpenAI does not have.&lt;/p&gt;

&lt;p&gt;However, these arguments ignore the reality of model quantization. Using 4-bit quantization (GGUF format), you can run a Llama 3 8B model on a machine with as little as 8GB of VRAM with negligible loss in reasoning capability for log analysis. For DevOps tasks, you do not need the creative writing abilities of a 175B parameter model; you need a model that understands stack traces and Kubernetes events.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exceptions where cloud LLMs still win
&lt;/h2&gt;

&lt;p&gt;There are specific scenarios where a local LLM is the wrong tool. If you are a tiny startup with zero regulatory constraints and no dedicated hardware, the overhead of managing an Ollama instance is a distraction. In those cases, the speed of onboarding a cloud API outweighs the privacy risks.&lt;/p&gt;

&lt;p&gt;Cloud LLMs also win when you need cross-domain knowledge at an extreme scale. If your log error is caused by a very obscure bug in a niche third party library that was updated two weeks ago, a cloud model trained on the most recent web crawl might have the answer. A local model's knowledge is frozen at the time of its training.&lt;/p&gt;

&lt;p&gt;Additionally, if your team requires a collaborative, multi-user interface with complex permissioning and auditing for every single prompt, building that on top of Open WebUI requires more effort than using a managed SaaS platform. For the senior SRE who needs to diagnose a production outage in a secure environment, these advantages are irrelevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Privacy-First Stack
&lt;/h2&gt;

&lt;p&gt;To move from theory to production, use Ollama for the backend, Llama 3 (8B) for the reasoning, and Open WebUI for the interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing the Engine
&lt;/h3&gt;

&lt;p&gt;On a Linux workstation with an NVIDIA GPU, install Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once installed, pull the Llama 3 model. I recommend the 8B version for most log tasks as it balances speed and accuracy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama run llama3:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Log Pipeline Architecture
&lt;/h3&gt;

&lt;p&gt;You cannot dump a 1GB log file into the model. You must use a pipeline. The most effective flow is: &lt;code&gt;Log Source&lt;/code&gt; → &lt;code&gt;Grep/Awk Filter&lt;/code&gt; → &lt;code&gt;Local LLM&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For example, if you are debugging an OOMKilled pod, first extract the relevant events. If you have already followed the steps to &lt;a href="https://dev.to/troubleshooting/how-to-debug-oomkilled-pods-in-kubernetes-a-step-by-step-gui"&gt;Debug OOMKilled Pods in Kubernetes&lt;/a&gt;, you know that the &lt;code&gt;describe&lt;/code&gt; output is more valuable than the application logs.&lt;/p&gt;

&lt;p&gt;Use this bash script to automate the extraction and analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Extract the last 100 lines of logs and the pod description&lt;/span&gt;
kubectl describe pod my-app-6f7d8-abc &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pod_desc.txt
kubectl logs my-app-6f7d8-abc &lt;span class="nt"&gt;--tail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;100 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pod_logs.txt

&lt;span class="c"&gt;# Combine them into a prompt file&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Act as a Senior SRE. Analyze the following Kubernetes pod description and logs to find the root cause of the failure. Focus on memory limits and exit codes."&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; prompt.txt
&lt;span class="nb"&gt;cat &lt;/span&gt;pod_desc.txt &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prompt.txt
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"--- LOGS ---"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prompt.txt
&lt;span class="nb"&gt;cat &lt;/span&gt;pod_logs.txt &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; prompt.txt

&lt;span class="c"&gt;# Pipe the prompt to Ollama&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;prompt.txt | ollama run llama3:8b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Prompt Engineering for DevOps
&lt;/h3&gt;

&lt;p&gt;Generic prompts yield generic answers. To get production-ready insights, give the LLM a persona and a specific constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad Prompt:&lt;/strong&gt; "What is wrong with these logs?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good Prompt:&lt;/strong&gt;&lt;br&gt;
"Act as a Site Reliability Engineer specializing in Java Spring Boot applications. I am providing a heap dump summary and the last 50 lines of the application log. Identify if this is a Memory Leak or a sudden spike in traffic. Provide the answer in a bulleted list: 1. Root Cause, 2. Evidence from logs, 3. Recommended fix."&lt;/p&gt;

&lt;p&gt;When dealing with complex orchestration issues, such as those found when you &lt;a href="https://dev.to/troubleshooting/crashloopbackoff-kubernetes"&gt;Fix CrashLoopBackOff in Kubernetes Pods&lt;/a&gt;, use this template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Persona: Kubernetes Expert
Context: Pod is in CrashLoopBackOff.
Task: Analyze the 'Last State' termination message and the current logs.
Constraint: Ignore health check failures; focus on application-level exceptions.
Logs: [Insert Logs Here]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Hardware Requirements and Performance
&lt;/h2&gt;

&lt;p&gt;The sweet spot for local log analysis is a machine with 24GB of VRAM (like an RTX 3090 or 4090). This allows you to run the 8B model with a massive context window or even experiment with the 70B model using heavy quantization.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Minimum (Fast)&lt;/th&gt;
&lt;th&gt;Recommended (Pro)&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPU&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 3060 (12GB)&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 4090 (24GB)&lt;/td&gt;
&lt;td&gt;VRAM is the primary metric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAM&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;td&gt;64GB&lt;/td&gt;
&lt;td&gt;Used for offloading if VRAM fills&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;50GB SSD&lt;/td&gt;
&lt;td&gt;200GB NVMe&lt;/td&gt;
&lt;td&gt;Models are large (4GB to 40GB each)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 22.04&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04&lt;/td&gt;
&lt;td&gt;Best driver support for CUDA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you are forced to run on CPU (Apple Silicon M2/M3 is an exception and works great), expect a drop from 50 tokens per second to about 3 to 5 tokens per second. This is acceptable for asynchronous log analysis but frustrating for interactive chatting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic Anomaly Detection vs. Regex
&lt;/h2&gt;

&lt;p&gt;Standard observability tools like Splunk or ELK rely on indices and keyword searches. If you search for "Error", you find errors. But what if the system is failing silently?&lt;/p&gt;

&lt;p&gt;Example: A payment gateway returns &lt;code&gt;200 OK&lt;/code&gt; for every request, but the response body says &lt;code&gt;{"status": "pending", "reason": "timeout"}&lt;/code&gt;. A regex monitor sees the &lt;code&gt;200&lt;/code&gt; and stays green. A local LLM can be prompted to look for logical contradictions:&lt;/p&gt;

&lt;p&gt;"Analyze these logs for 'silent failures'. Look for cases where the HTTP status is 200 but the response body indicates a failure or a timeout."&lt;/p&gt;

&lt;p&gt;This move from syntactic analysis (looking for patterns) to semantic analysis (understanding meaning) is the real power of the local LLM. It allows you to find the unknown unknowns that you didn't know to write a regex for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Log Streamlining and Noise Reduction
&lt;/h2&gt;

&lt;p&gt;One of the biggest costs in DevOps is Log Bloat. We store terabytes of &lt;code&gt;INFO&lt;/code&gt; logs that we never read. You can use a local LLM as a pre-processor to summarize logs before they are even archived.&lt;/p&gt;

&lt;p&gt;By running a small, fast model like Mistral v0.3, you can create a Log Summarizer that takes 1,000 lines of verbose debug logs and converts them into three sentences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The application started successfully.&lt;/li&gt;
&lt;li&gt;It attempted to connect to the database three times and failed.&lt;/li&gt;
&lt;li&gt;It entered a sleep state for 30 seconds.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This reduces the cognitive load on the human engineer and can potentially reduce storage costs if you only archive the summaries and a sampled percentage of the raw logs.&lt;/p&gt;

&lt;p&gt;Local LLMs are the only viable path for secure, privacy-first debugging in highly regulated environments. While the hardware requirements are higher than using a cloud API, the trade-off is a total elimination of PII leakage risk and the removal of per-token costs. Start by installing Ollama on a GPU-enabled jump box, select a 4-bit quantized Llama 3 model, and begin piping your &lt;code&gt;kubectl&lt;/code&gt; outputs into it to reduce your mean time to resolution (MTTR).&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>localllm</category>
      <category>loganalysis</category>
      <category>devopssecurity</category>
    </item>
    <item>
      <title>How to Build AI Agents for Kubernetes Deployments</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:15:34 +0000</pubDate>
      <link>https://dev.to/devopsstart/how-to-build-ai-agents-for-kubernetes-deployments-34m</link>
      <guid>https://dev.to/devopsstart/how-to-build-ai-agents-for-kubernetes-deployments-34m</guid>
      <description>&lt;p&gt;&lt;em&gt;Ever wanted an AI that doesn't just explain Kubernetes errors but actually helps you fix them? This guide, originally published on devopsstart.com, walks through building autonomous K8s agents using MCP, Kagent, and K8sGPT.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;AI agents for Kubernetes deployments are autonomous systems that follow an "Observe → Reason → Act" loop to resolve cluster issues without manual intervention. While a standard LLM can explain what a &lt;code&gt;CrashLoopBackOff&lt;/code&gt; is, a true agent can detect the error, pull the logs, analyze the stack trace, cross-reference it with recent Git commits, and propose a specific PR to fix the environment variable causing the crash.&lt;/p&gt;

&lt;p&gt;Building these agents requires moving beyond simple prompting and into "tool use" or "function calling." You are essentially giving an LLM a set of specialized skills (API wrappers) that allow it to interact with your cluster, your GitOps pipeline, and your observability stack. In this guide, you will learn how to architect these skills using the Model Context Protocol (MCP) and frameworks like Kagent and K8sGPT to automate the most tedious parts of Kubernetes operations.&lt;/p&gt;

&lt;p&gt;For a deep dive into the foundational concepts of managing the pods these agents will be monitoring, see the guide on &lt;a href="https://dev.to/blog/kubernetes-for-beginners-deploy-your-first-application"&gt;Kubernetes for Beginners: Deploy Your First Application&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before starting this tutorial, you need a functioning Kubernetes environment and the necessary API access for the LLM. I recommend a development cluster (Kind or Minikube) or a staging namespace in a cloud provider like GKE or EKS to avoid accidental production outages.&lt;/p&gt;

&lt;p&gt;You will need the following tools installed on your local machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubectl v1.30+&lt;/strong&gt;: The standard Kubernetes CLI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Helm v3.14+&lt;/strong&gt;: For managing the agent's dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.11+&lt;/strong&gt;: Most agent frameworks, including Kagent and LangChain, require modern Python.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An OpenAI API Key (GPT-4o)&lt;/strong&gt; or &lt;strong&gt;Anthropic API Key (Claude 3.5 Sonnet)&lt;/strong&gt;: Agents require high-reasoning models to avoid hallucinations during tool selection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;K8sGPT v0.12+&lt;/strong&gt;: For the diagnostic skill set implementation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should also have a basic understanding of Kubernetes RBAC. Agents operate as identities within the cluster, and giving them &lt;code&gt;cluster-admin&lt;/code&gt; privileges is a security risk. You will need to be comfortable creating ServiceAccounts and RoleBindings to enforce the principle of least privilege.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;In this tutorial, we are building a "Deployment Guardian" agent. This isn't a monolithic script, but a modular system capable of three specific skills:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Automated Diagnostics&lt;/strong&gt;: Using K8sGPT to scan for misconfigurations and interpreting those errors using an LLM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Right-Sizing&lt;/strong&gt;: Analyzing pod resource usage and suggesting updates to the Horizontal Pod Autoscaler (HPA).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitOps Sync Validation&lt;/strong&gt;: Monitoring ArgoCD application health and triggering syncs when drifts are detected.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The core of this architecture relies on the Model Context Protocol (MCP). MCP is an open standard that decouples the LLM from the specific implementation of the tool. Instead of writing a custom wrapper for every single &lt;code&gt;kubectl&lt;/code&gt; command, MCP allows you to expose a standardized "server" that tells the LLM exactly what tools are available, what arguments they take, and what the expected output format is.&lt;/p&gt;

&lt;p&gt;By the end of this guide, you will have an agent that provides the root cause and the exact YAML change needed to fix a deployment, integrated directly into your operational workflow. For those managing the underlying infrastructure of these clusters, understanding how to &lt;a href="https://dev.to/tutorials/deploy-eks-cluster-with-terraform"&gt;Deploy an EKS Cluster with Terraform&lt;/a&gt; provides the necessary context for where these agents actually reside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Architecting the Agent Loop
&lt;/h2&gt;

&lt;p&gt;Before writing code, you must understand how the agent thinks. A standard LLM request is a linear path: Prompt → Response. An agent loop is circular.&lt;/p&gt;

&lt;p&gt;When you ask an agent to "Fix the failing deployment in the staging namespace," it performs the following sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: The agent calls a tool (for example, &lt;code&gt;get_pod_status&lt;/code&gt;) to see which pods are failing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: It observes three pods in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; and reasons that it needs logs to understand the root cause.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: It calls &lt;code&gt;get_pod_logs&lt;/code&gt; for one of the failing pods.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: The logs show a &lt;code&gt;java.lang.NullPointerException&lt;/code&gt; related to a missing database URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: It checks the ConfigMap to see if the environment variable is defined.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: It calls &lt;code&gt;get_configmap&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Final Response&lt;/strong&gt;: It concludes the environment variable is missing and suggests the specific &lt;code&gt;kubectl patch&lt;/code&gt; command or Git PR.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To implement this, you can use a framework like Kagent, which is built on AutoGen. It treats the "DevOps Engineer" as one agent and the "Kubernetes Cluster" as a tool-providing environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Implementing the Tooling Layer with MCP
&lt;/h2&gt;

&lt;p&gt;The Model Context Protocol (MCP) is the primary mechanism for production-grade agents. Instead of hardcoding functions into your Python script, you run an MCP server that exposes your Kubernetes API.&lt;/p&gt;

&lt;p&gt;First, install the MCP SDK for Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, create a simple MCP server that provides a "skill" to get pod events. This is more efficient than giving the LLM raw &lt;code&gt;kubectl&lt;/code&gt; access because you can filter the output to only include errors, which reduces token usage and hallucination risk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# k8s_mcp_server.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;mcp.server.fastmcp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FastMCP&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;

&lt;span class="n"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FastMCP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;K8s-Guardian&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_errors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fetches only Warning events for pods in a specific namespace.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--field-selector&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type=Warning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;returncode&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error fetching events: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No warning events found.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run this server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python k8s_mcp_server.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM now sees &lt;code&gt;get_pod_errors&lt;/code&gt; as a capability. When it encounters a deployment failure, it will autonomously decide to call this function rather than guessing. This architectural separation allows you to update the Python "skill" without changing the prompt of the LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Configuring Least-Privilege RBAC
&lt;/h2&gt;

&lt;p&gt;Giving an AI agent a &lt;code&gt;kubeconfig&lt;/code&gt; with &lt;code&gt;cluster-admin&lt;/code&gt; is an unacceptable security risk. If the LLM hallucinates a command like &lt;code&gt;kubectl delete ns --all&lt;/code&gt;, the agent will execute it.&lt;/p&gt;

&lt;p&gt;You must create a dedicated &lt;code&gt;ServiceAccount&lt;/code&gt; with a restricted &lt;code&gt;Role&lt;/code&gt;. For our Deployment Guardian, the agent needs to read pods, events, and logs, but it should only be able to "patch" specific resources.&lt;/p&gt;

&lt;p&gt;Create a file named &lt;code&gt;agent-rbac.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s-ai-agent&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai-ops&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent-read-write-role&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pods/log"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;events"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configmaps"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;watch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apps"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployments"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replicasets"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;verbs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;list"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RoleBinding&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent-read-write-binding&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
&lt;span class="na"&gt;subjects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s-ai-agent&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai-ops&lt;/span&gt;
&lt;span class="na"&gt;roleRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Role&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent-read-write-role&lt;/span&gt;
  &lt;span class="na"&gt;apiGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rbac.authorization.k8s.io&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace ai-ops
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; agent-rbac.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To connect your agent to this identity, use a token-based approach or a projected volume if the agent runs inside the cluster. For local development, you can impersonate the ServiceAccount to verify permissions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; staging &lt;span class="nt"&gt;--as&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;system:serviceaccount:ai-ops:k8s-ai-agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Integrating K8sGPT for Diagnostic Skills
&lt;/h2&gt;

&lt;p&gt;While custom MCP tools are great for specific tasks, K8sGPT provides a powerful set of pre-built diagnostic skills. It scans your cluster for common issues and uses an LLM to explain them.&lt;/p&gt;

&lt;p&gt;First, install the K8sGPT CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;k8sgpt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, authenticate it with your LLM provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;k8sgpt auth add &lt;span class="nt"&gt;--backend&lt;/span&gt; openai &lt;span class="nt"&gt;--model&lt;/span&gt; gpt-4o
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To integrate K8sGPT into your agent's skill set, wrap the &lt;code&gt;k8sgpt analyze&lt;/code&gt; command into a tool. This allows the agent to trigger a full cluster scan and reason over the results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Adding K8sGPT as a tool in our MCP server
&lt;/span&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_cluster_health&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Runs a K8sGPT analysis on the namespace to find errors.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k8sgpt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--namespace&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run this, the output provides a detailed analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ k8sgpt analyze --namespace staging
[!] Pod 'auth-service-6f7d' is in CrashLoopBackOff
Analysis: The pod is failing because the 'DB_PASSWORD' environment variable is missing.
The application expects this variable to be provided via a Secret.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent can now combine this high-level analysis with its own &lt;code&gt;get_configmap&lt;/code&gt; tool to find where the secret is missing. This creates a tiered diagnostic approach: K8sGPT finds the "what," and the custom MCP tools find the "how" and "where." If you see these errors frequently, check the &lt;a href="https://dev.to/troubleshooting/how-to-fix-kubernetes-crashloopbackoff-in-production"&gt;Fix Kubernetes CrashLoopBackOff in Production&lt;/a&gt; guide for manual remediation steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Building the Resource Optimization Skill
&lt;/h2&gt;

&lt;p&gt;Resource optimization requires the agent to observe metrics (via Prometheus or Metrics Server) and act on the Horizontal Pod Autoscaler (HPA).&lt;/p&gt;

&lt;p&gt;To implement this, your agent needs a tool that can query the Metrics Server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_pod_resource_usage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieves CPU and Memory usage for a specific pod.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pod_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent's reasoning logic for optimization follows this pattern:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trigger&lt;/strong&gt;: The agent is asked to "Optimize the checkout-service."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: It calls &lt;code&gt;get_pod_resource_usage&lt;/code&gt; and sees the pod is consistently using 95% of its memory limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observation&lt;/strong&gt;: It calls &lt;code&gt;kubectl get hpa&lt;/code&gt; and sees the HPA is targeting 50% CPU, but the bottleneck is actually memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: The agent realizes the HPA should be updated to include memory metrics or the memory limit should be increased.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt;: It proposes a YAML change to the HPA definition.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a detailed explanation of how HPA works to better tune your agent's prompts, read the &lt;a href="https://dev.to/blog/kubernetes-hpa-deep-dive-autoscaling-explained"&gt;Kubernetes HPA Deep Dive&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Automating GitOps with ArgoCD Integration
&lt;/h2&gt;

&lt;p&gt;An agent that runs &lt;code&gt;kubectl patch&lt;/code&gt; directly creates "configuration drift." The source of truth must always be Git. Therefore, your agent's "Act" phase should target your GitOps tool.&lt;/p&gt;

&lt;p&gt;If you are using ArgoCD, give your agent tools to interact with the ArgoCD API or the Git repository. First, ensure you have ArgoCD installed; if not, follow the &lt;a href="https://dev.to/tutorials/how-to-install-argo-cd-gitops-deployment-on-kubernetes"&gt;How to Install Argo CD&lt;/a&gt; guide.&lt;/p&gt;

&lt;p&gt;Now, create a tool that allows the agent to check the sync status of an application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_argocd_app_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Checks if an ArgoCD application is Synced and Healthy.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;argocd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;get&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "GitOps Loop" for the agent is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect&lt;/strong&gt;: The agent sees a pod failing in the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnose&lt;/strong&gt;: It finds that the image tag &lt;code&gt;v1.2.0&lt;/code&gt; has a bug.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolve&lt;/strong&gt;: It searches for the latest stable image tag in the registry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt;: Instead of running &lt;code&gt;kubectl set image&lt;/code&gt;, it uses a GitHub API tool to create a Pull Request updating the image tag in the Git repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt;: It monitors ArgoCD until the app shows as &lt;code&gt;Synced&lt;/code&gt; and &lt;code&gt;Healthy&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This workflow ensures the AI agent remains a part of the governed pipeline. You can learn more about managing these sync policies in the &lt;a href="https://dev.to/tutorials/how-to-configure-advanced-argo-cd-sync-policies-for-gitops"&gt;Advanced Argo CD Sync Policies&lt;/a&gt; tutorial.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Implementing Safety Rails and Human-in-the-Loop (HITL)
&lt;/h2&gt;

&lt;p&gt;To prevent "hallucination-driven outages," you must implement a safety layer between the agent's reasoning and the action.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Dry-Run Constraint
&lt;/h3&gt;

&lt;p&gt;Every tool that modifies the cluster must implement a &lt;code&gt;--dry-run=server&lt;/code&gt; flag by default. The agent should first call the tool in dry-run mode and present the proposed change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@mcp.tool&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;propose_deployment_patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;deployment_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patch_yaml&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Proposes a change to a deployment using dry-run.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patch_yaml&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;cmd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kubectl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deployment&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deployment_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--patch-file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;patch.yaml&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--dry-run=server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Proposed Change: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. The Approval Gate (HITL)
&lt;/h3&gt;

&lt;p&gt;The agent must not execute a &lt;code&gt;patch&lt;/code&gt; or &lt;code&gt;delete&lt;/code&gt; command without manual approval from a human operator, typically via a Slack bot or CLI prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: "I've found that the &lt;code&gt;auth-service&lt;/code&gt; is OOMKilled. I propose increasing the memory limit from 256Mi to 512Mi. Should I apply this change? [Yes/No]"&lt;br&gt;
&lt;strong&gt;Human&lt;/strong&gt;: "Yes"&lt;br&gt;
&lt;strong&gt;Agent&lt;/strong&gt;: (Executes the actual patch command)&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Policy-as-Code (Kyverno/OPA)
&lt;/h3&gt;

&lt;p&gt;A cluster-level policy engine like Kyverno or OPA Gatekeeper should be the final line of defense. For example, a policy that prevents any resource from being deleted in the &lt;code&gt;production&lt;/code&gt; namespace, regardless of the requester's identity.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 8: Testing and Validating Agent Performance
&lt;/h2&gt;

&lt;p&gt;Treat your agent's skills like production code.&lt;/p&gt;
&lt;h3&gt;
  
  
  Unit Testing Tools
&lt;/h3&gt;

&lt;p&gt;Test each MCP tool independently. If your &lt;code&gt;get_pod_errors&lt;/code&gt; tool fails to parse &lt;code&gt;kubectl&lt;/code&gt; output, the LLM will receive garbage and hallucinate a solution.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example test for the tool&lt;/span&gt;
python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"from k8s_mcp_server import get_pod_errors; print(get_pod_errors('staging'))"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario-Based Validation (Chaos Engineering)
&lt;/h3&gt;

&lt;p&gt;Test your agent by intentionally breaking things in a sandbox:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inject a Failure&lt;/strong&gt;: Delete a Secret that a deployment needs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trigger Agent&lt;/strong&gt;: Ask, "Why is the deployment failing?"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Did it find the missing secret? (Correctness)&lt;/li&gt;
&lt;li&gt;Did it suggest the right fix? (Accuracy)&lt;/li&gt;
&lt;li&gt;Did it try to delete the namespace? (Safety)&lt;/li&gt;
&lt;li&gt;How many tool calls did it take? (Efficiency)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Token Cost and Latency Tracking
&lt;/h3&gt;

&lt;p&gt;Agents can be expensive. A complex diagnostic loop might call 10 different tools, sending significant context back to the LLM. Use tools like LangSmith or Arize Phoenix to trace the agent's thoughts. If the agent loops infinitely (calling the same tool repeatedly), refine the system prompt to include a "maximum tool call" limit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Agent "Loops" Infinitely
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The agent calls &lt;code&gt;get_pod_status&lt;/code&gt; repeatedly for 20 turns.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Update the system prompt: "If a tool returns the same result twice, do not call it again. Instead, try a different diagnostic tool or ask the user for more information."&lt;/p&gt;

&lt;h3&gt;
  
  
  RBAC "Forbidden" Errors
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: &lt;code&gt;Error from server (Forbidden): pods "my-pod" is forbidden: User "system:serviceaccount:ai-ops:k8s-ai-agent" cannot get resource "pods/log"&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Check your &lt;code&gt;Role&lt;/code&gt; definition. &lt;code&gt;pods&lt;/code&gt; and &lt;code&gt;pods/log&lt;/code&gt; are different resources in Kubernetes. You must explicitly list &lt;code&gt;pods/log&lt;/code&gt; in the &lt;code&gt;resources&lt;/code&gt; section of your RBAC YAML.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hallucinated CLI Flags
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The agent tries to run &lt;code&gt;kubectl get pods --show-all-errors&lt;/code&gt;, which is not a real flag.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Be explicit in your MCP tool description. Instead of "Fetch pods," say "Fetch pods using the exact command &lt;code&gt;kubectl get pods -n {namespace}&lt;/code&gt;."&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Window Overflow
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom&lt;/strong&gt;: The agent "forgets" the initial error after calling several tools.&lt;br&gt;
&lt;strong&gt;Fix&lt;/strong&gt;: Implement "summarization" in your tools. Instead of returning raw &lt;code&gt;kubectl&lt;/code&gt; output, filter for the top 5 most relevant errors before sending the text to the LLM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building AI agents for Kubernetes is a shift from "writing scripts" to "designing capabilities." By utilizing the Model Context Protocol (MCP), you decouple your agent's reasoning from the underlying API calls, allowing you to iterate on "skills" without breaking the agent's logic.&lt;/p&gt;

&lt;p&gt;We have moved from the basic "Observe → Reason → Act" loop to a production-ready architecture featuring least-privilege RBAC, GitOps integration via ArgoCD, and strict human-in-the-loop safety rails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable Next Steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start Small&lt;/strong&gt;: Implement one "read-only" skill (like the &lt;code&gt;get_pod_errors&lt;/code&gt; tool) and run it in a local Kind cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure the Perimeter&lt;/strong&gt;: Apply the RBAC constraints before moving the agent to a shared development environment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement the Gate&lt;/strong&gt;: Add a manual approval step for any tool that uses &lt;code&gt;kubectl patch&lt;/code&gt; or &lt;code&gt;kubectl delete&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor and Refine&lt;/strong&gt;: Use a tracing tool to see where your agent is hallucinating and refine your tool descriptions accordingly.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>kubernetesaiagents</category>
      <category>modelcontextprotocol</category>
      <category>k8sgpt</category>
      <category>gitopsautomation</category>
    </item>
    <item>
      <title>How to Manage Multiple Azure Subscriptions in Terraform</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 22:01:41 +0000</pubDate>
      <link>https://dev.to/devopsstart/how-to-manage-multiple-azure-subscriptions-in-terraform-1bnf</link>
      <guid>https://dev.to/devopsstart/how-to-manage-multiple-azure-subscriptions-in-terraform-1bnf</guid>
      <description>&lt;p&gt;&lt;em&gt;Managing Hub-and-Spoke architectures in Azure can be a challenge when dealing with multiple subscriptions. This guide, originally published on devopsstart.com, explains how to use Terraform provider aliases to streamline your deployments.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Manage Multiple Azure Subscriptions in Terraform
&lt;/h2&gt;

&lt;p&gt;To deploy resources across multiple Azure subscriptions in a single Terraform configuration, you must use provider aliases. By default, the &lt;code&gt;azurerm&lt;/code&gt; provider targets only one subscription based on your authentication context. To override this, you define multiple provider blocks, assigning an &lt;code&gt;alias&lt;/code&gt; to each and specifying a unique &lt;code&gt;subscription_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This pattern is essential for Hub-and-Spoke network architectures. In these environments, central shared services (like Azure Firewall or ExpressRoute) live in a Hub subscription, while application workloads reside in separate Spoke subscriptions. Without aliases, you would be forced to run separate Terraform states and pipelines for every single subscription, which makes cross-subscription networking a manual nightmare.&lt;/p&gt;

&lt;p&gt;You can find the complete provider specification in the &lt;a href="https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs" rel="noopener noreferrer"&gt;official Terraform Azure Provider documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Provider Aliases
&lt;/h2&gt;

&lt;p&gt;To start, you need to configure your &lt;code&gt;providers.tf&lt;/code&gt; file. The provider without an alias becomes the default. Any provider with an &lt;code&gt;alias&lt;/code&gt; must be explicitly called when defining a resource using the &lt;code&gt;provider&lt;/code&gt; meta-argument.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# providers.tf&lt;/span&gt;

&lt;span class="c1"&gt;# Default provider (Spoke Subscription)&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"azurerm"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;features&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nx"&gt;subscription_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"00000000-0000-0000-0000-000000000000"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Aliased provider (Hub Subscription)&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"azurerm"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;alias&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub"&lt;/span&gt;
  &lt;span class="nx"&gt;features&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
  &lt;span class="nx"&gt;subscription_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"11111111-1111-1111-1111-111111111111"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you create a resource, use the &lt;code&gt;provider&lt;/code&gt; argument to tell Terraform which subscription to use. If you omit this, Terraform defaults to the primary provider.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Deploy a VNet in the Hub subscription&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network"&lt;/span&gt; &lt;span class="s2"&gt;"hub_vnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provider&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-vnet"&lt;/span&gt;
  &lt;span class="nx"&gt;address_space&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;location&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eastus"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-rg"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Deploy a VNet in the Spoke subscription (default provider)&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network"&lt;/span&gt; &lt;span class="s2"&gt;"spoke_vnet"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-vnet"&lt;/span&gt;
  &lt;span class="nx"&gt;address_space&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"10.1.0.0/16"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;location&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eastus"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-rg"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cross-Subscription Data Referencing
&lt;/h2&gt;

&lt;p&gt;A common production scenario involves fetching an existing resource ID from a Hub subscription to use as a property in a Spoke resource, such as creating a VNet peering. In my experience, this is where most "Resource Not Found" errors occur because the data block defaults to the wrong subscription.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fetch Hub VNet ID from the Hub subscription&lt;/span&gt;
&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network"&lt;/span&gt; &lt;span class="s2"&gt;"hub_vnet_data"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provider&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-vnet"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hub-rg"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Create peering in the Spoke subscription pointing to the Hub&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"azurerm_virtual_network_peering"&lt;/span&gt; &lt;span class="s2"&gt;"spoke_to_hub"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;                      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-to-hub"&lt;/span&gt;
  &lt;span class="nx"&gt;resource_group_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spoke-rg"&lt;/span&gt;
  &lt;span class="nx"&gt;virtual_network_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm_virtual_network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;spoke_vnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;remote_virtual_network_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;azurerm_virtual_network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub_vnet_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By explicitly assigning &lt;code&gt;provider = azurerm.hub&lt;/code&gt; to the data block, Terraform authenticates against the Hub subscription to retrieve the ID before attempting to create the peering in the Spoke subscription.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Module Provider Gotcha
&lt;/h2&gt;

&lt;p&gt;The biggest mistake engineers make with multi-subscription setups is assuming modules inherit aliases automatically. They do not. If you call a module and it contains &lt;code&gt;azurerm&lt;/code&gt; resources, those resources will use the default provider regardless of where the module is called from.&lt;/p&gt;

&lt;p&gt;To fix this, you must explicitly pass the aliased provider into the module using the &lt;code&gt;providers&lt;/code&gt; map.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"spoke_workload"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"./modules/workload"&lt;/span&gt;

  &lt;span class="c1"&gt;# Map the module's internal 'azurerm' provider to the 'hub' alias&lt;/span&gt;
  &lt;span class="nx"&gt;providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;azurerm&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;azurerm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;vnet_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;azurerm_virtual_network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hub_vnet_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the module code, do not define a &lt;code&gt;provider&lt;/code&gt; block. Just use the standard &lt;code&gt;azurerm&lt;/code&gt; resource blocks; the mapping happens at the root level. This ensures your modules remain reusable across different environments. I have seen this fail in clusters with &amp;gt;50 nodes where a missed provider mapping caused a production workload to be deployed into a development subscription, leading to significant security audit failures. To maintain high reliability, consider &lt;a href="https://dev.to/blog/terraform-testing-best-practices-beyond-plan-and-pray"&gt;testing your infrastructure as code&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Naming and Scale
&lt;/h2&gt;

&lt;p&gt;Avoid generic names like &lt;code&gt;azurerm.sub1&lt;/code&gt; or &lt;code&gt;azurerm.secondary&lt;/code&gt;. In a production environment with dozens of subscriptions, these names provide zero context and lead to configuration errors. Use functional names that describe the role of the subscription:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;azurerm.hub&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;azurerm.shared_services&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;azurerm.prod_workload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;azurerm.identity_mgmt&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In environments with more than 50 subscriptions, managing these aliases in a single &lt;code&gt;providers.tf&lt;/code&gt; file becomes brittle. At that scale, I recommend splitting your state files by subscription or using a wrapper tool. This reduces the blast radius of a single &lt;code&gt;terraform apply&lt;/code&gt; and decreases the time spent in the "refreshing state" phase, which can otherwise take several minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Can I use the same Service Principal for multiple subscriptions?&lt;/strong&gt;&lt;br&gt;
Yes, as long as that Service Principal has the required RBAC roles (for example, Contributor) across all targeted subscriptions. Terraform handles the switching via the &lt;code&gt;subscription_id&lt;/code&gt; field in the provider block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do I need to run &lt;code&gt;az account set&lt;/code&gt; before running Terraform?&lt;/strong&gt;&lt;br&gt;
No. When you explicitly define &lt;code&gt;subscription_id&lt;/code&gt; in the provider block, Terraform ignores the current active subscription in your Azure CLI session and targets the ID specified in the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does using aliases increase the plan time?&lt;/strong&gt;&lt;br&gt;
Slightly. Terraform must establish separate API sessions for each provider instance. In very large environments, this can add 10 to 30 seconds to the refresh phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using provider aliases is the only professional way to handle multi-subscription Azure deployments. By separating your Hub and Spoke configurations and explicitly passing providers to your modules, you eliminate the risk of deploying resources to the wrong environment.&lt;/p&gt;

&lt;p&gt;Your next steps should be to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audit your current &lt;code&gt;providers.tf&lt;/code&gt; and rename any generic aliases to functional names.&lt;/li&gt;
&lt;li&gt;Check your module calls to ensure &lt;code&gt;providers = { ... }&lt;/code&gt; is being used for all non-default subscriptions.&lt;/li&gt;
&lt;li&gt;Implement data blocks to automate the linkage between Hub and Spoke resources.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>terraformazure</category>
      <category>azurermprovider</category>
      <category>infrastructureascode</category>
      <category>azuresubscriptionmanagement</category>
    </item>
    <item>
      <title>GitHub Actions Security: How to Stop Secret Leaks in CI/CD</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 21:46:31 +0000</pubDate>
      <link>https://dev.to/devopsstart/github-actions-security-how-to-stop-secret-leaks-in-cicd-2nh5</link>
      <guid>https://dev.to/devopsstart/github-actions-security-how-to-stop-secret-leaks-in-cicd-2nh5</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on devopsstart.com, this guide explores how to eliminate static secrets and harden your GitHub Actions pipelines against credential theft.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The fastest way to compromise a production environment isn't by hacking a firewall; it's by stealing a long-lived AWS Access Key leaked in a GitHub Actions log. Secret leakage in CI/CD pipelines is a systemic risk because these pipelines possess the "keys to the kingdom", allowing them to provision infrastructure, modify databases and push code to production.&lt;/p&gt;

&lt;p&gt;When secrets leak, they typically happen through three vectors: accidental logging, compromised third-party actions or malicious pull requests from external contributors. To stop this, you must move from static secrets to identity-based authentication using OpenID Connect (OIDC) and implement a strict least-privilege model for your workflow permissions.&lt;/p&gt;

&lt;p&gt;In this guide, you will learn how to implement OIDC, the danger of mutable version tags, and how to defend against "pwn-request" attacks. For those managing complex infrastructure, combining these security practices with &lt;a href="https://dev.to/tutorials/how-to-automate-terraform-reviews-with-github-actions"&gt;how to automate terraform reviews with github actions&lt;/a&gt; ensures that security is baked into the code review process, not just the execution phase.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Anatomy of a Secret Leak: Why Your Logs Aren't Safe
&lt;/h2&gt;

&lt;p&gt;GitHub provides a built-in masking feature that replaces known secrets with asterisks (&lt;code&gt;***&lt;/code&gt;) in the logs. However, this is a convenience feature, not a security boundary. Attackers can easily bypass masking by encoding the secret. If a developer runs &lt;code&gt;echo $SECRET | base64&lt;/code&gt;, the resulting string is no longer the original secret and will not be masked. Any user with read access to the action run can decode it instantly.&lt;/p&gt;

&lt;p&gt;Another common leak vector is the "debug dump". When a pipeline fails, developers often add &lt;code&gt;run: env&lt;/code&gt; or &lt;code&gt;run: printenv&lt;/code&gt; to debug the environment. This prints every single environment variable to the logs. While GitHub tries to mask the secrets, any variable that was dynamically generated or slightly modified during the build process will leak in plain text.&lt;/p&gt;

&lt;p&gt;The most dangerous leak comes from the supply chain. If you use a third-party action like &lt;code&gt;uses: some-random-user/setup-tool@v1&lt;/code&gt;, you are executing arbitrary code from that user's repository. If that account is compromised, the attacker can update the code in &lt;code&gt;@v1&lt;/code&gt; to &lt;code&gt;curl&lt;/code&gt; your environment variables to an external server. Because the action runs with the &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; and any secrets you passed to it, the attacker gains full access without leaving a trace in your logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving from Static Secrets to OIDC
&lt;/h2&gt;

&lt;p&gt;The industry standard for securing cloud access in CI/CD is OpenID Connect (OIDC). Long-lived IAM keys (the &lt;code&gt;AWS_ACCESS_KEY_ID&lt;/code&gt; and &lt;code&gt;AWS_SECRET_ACCESS_KEY&lt;/code&gt; pair) are liabilities because they never expire and are often stored as static GitHub Secrets. If these leak, they remain valid until you manually rotate them. OIDC replaces these static keys with short-lived, identity-based tokens.&lt;/p&gt;

&lt;p&gt;With OIDC, GitHub Actions acts as an Identity Provider (IdP). When a workflow runs, it requests a JWT (JSON Web Token) from GitHub. The workflow then presents this token to the cloud provider (AWS, Azure or GCP). The cloud provider verifies the token's signature and checks if the "claims" (such as the repository name or the branch) match a pre-defined trust relationship. If they match, the provider issues a temporary security token, typically valid for one hour.&lt;/p&gt;

&lt;p&gt;To implement this in AWS, you first create an IAM Role with a Trust Policy that trusts the GitHub OIDC provider. Then, use the official &lt;code&gt;aws-actions/configure-aws-credentials&lt;/code&gt; action (v4). You must specify &lt;code&gt;permissions: id-token: write&lt;/code&gt; in your YAML to allow the runner to request the JWT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: OIDC Authentication for AWS&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secure Deploy&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt; &lt;span class="nv"&gt;main&lt;/span&gt; &lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt; &lt;span class="c1"&gt;# Required for requesting the JWT&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;  &lt;span class="c1"&gt;# Required for checkout&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/github-oidc-role&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify Identity&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws sts get-caller-identity&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output of the last command shows the assumed role, not a static user. If this workflow is compromised, the attacker only has a temporary token that expires quickly, which reduces the blast radius significantly compared to static keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardening the Supply Chain: The Danger of Mutable Tags
&lt;/h2&gt;

&lt;p&gt;Most DevOps engineers use version tags when referencing actions, such as &lt;code&gt;uses: actions/checkout@v4&lt;/code&gt;. This looks clean, but it is a security anti-pattern. Tags in Git are mutable; a maintainer (or an attacker who has hijacked the account) can move the &lt;code&gt;v4&lt;/code&gt; tag to a different, malicious commit. You think you are using a trusted version, but the underlying code has changed without your knowledge.&lt;/p&gt;

&lt;p&gt;To eliminate this risk, pin actions to a full-length commit SHA. A SHA is an immutable fingerprint of the code. If the code changes by a single character, the SHA changes. While this makes updating actions more tedious, it is the only way to guarantee that the code you audited is the code running today.&lt;/p&gt;

&lt;p&gt;I have seen this fail in clusters with &amp;gt;50 nodes where a single compromised community action allowed an attacker to exfiltrate internal environment variables across dozens of repos. In a production environment with over 100 repositories, manually updating SHAs is a burden. Use a tool like Renovate Bot or Dependabot to automate these updates while keeping them pinned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# UNSAFE: Using a mutable tag&lt;/span&gt;
&lt;span class="c1"&gt;# If the maintainer changes what @v4 points to, your pipeline is compromised.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

&lt;span class="c1"&gt;# SAFE: Using a full-length commit SHA&lt;/span&gt;
&lt;span class="c1"&gt;# This code will NEVER change, regardless of what happens to the repository tags.&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11&lt;/span&gt; &lt;span class="c1"&gt;# v4.1.1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When pinning, always include a comment noting which version the SHA corresponds to. In clusters where security compliance is strict, such as those running on GKE Autopilot or hardened EKS nodes, this level of granularity is mandatory to pass SOC2 or ISO27001 audits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defending Against "Pwn-Requests" and Fork Attacks
&lt;/h2&gt;

&lt;p&gt;One of the most overlooked vulnerabilities in GitHub Actions is the handling of Pull Requests from forks. By default, the &lt;code&gt;pull_request&lt;/code&gt; event does not grant secrets to the runner for security reasons. However, developers often find this frustrating when they need to run integration tests that require a database key. To solve this, they use the &lt;code&gt;pull_request_target&lt;/code&gt; event.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;pull_request_target&lt;/code&gt; event is extremely dangerous. Unlike &lt;code&gt;pull_request&lt;/code&gt;, it runs in the context of the base branch (usually &lt;code&gt;main&lt;/code&gt;) and has access to secrets. If you have a workflow triggered by &lt;code&gt;pull_request_target&lt;/code&gt; that checks out the code from the PR branch and then runs a script, a malicious contributor can modify that script in their fork to &lt;code&gt;echo $SECRET | base64&lt;/code&gt;. Since the workflow runs with the base branch's permissions, the attacker steals your production credentials.&lt;/p&gt;

&lt;p&gt;To safely handle external contributions, never execute untrusted code from a fork while secrets are present. If you need to run tests on a PR, use the standard &lt;code&gt;pull_request&lt;/code&gt; event and utilize "Environment" protections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# DANGEROUS: Vulnerable to pwn-requests&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request_target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt; &lt;span class="c1"&gt;# This checks out the PR code from the fork&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm install &amp;amp;&amp;amp; npm test&lt;/span&gt; &lt;span class="c1"&gt;# The PR author can change 'npm test' to steal secrets&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.API_KEY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The correct pattern is to require a manual approval from a maintainer before a workflow can access a protected environment's secrets. This creates a human-in-the-loop firewall that prevents automated credential theft.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for CI/CD Hardening
&lt;/h2&gt;

&lt;p&gt;To maintain a secure posture, implement these five practices across every repository in your organization.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Implement a Global Permissions Policy&lt;/strong&gt;: Start every job with the most restrictive permissions. Use &lt;code&gt;permissions: contents: read&lt;/code&gt; by default and only add &lt;code&gt;id-token: write&lt;/code&gt; or &lt;code&gt;packages: write&lt;/code&gt; when specifically required. This prevents a compromised action from deleting your repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Environment-Based Secrets&lt;/strong&gt;: Do not put production secrets in the global "Repository Secrets" section. Create a "Production" environment and assign secrets there. This allows you to enforce "Required Reviewers", meaning no code can access production keys without a senior engineer's sign-off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate Secret Scanning&lt;/strong&gt;: Integrate Gitleaks or TruffleHog into your pipeline as a pre-commit hook or an initial CI step. These tools look for patterns (like &lt;code&gt;AKIA...&lt;/code&gt; for AWS) and fail the build if a secret is detected in the commit history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoid Secret Passing via Env&lt;/strong&gt;: Instead of passing secrets as environment variables to every step, pass them only to the specific step that needs them. This minimizes the number of processes that have the secret in their memory space.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rotate Credentials Every 90 Days&lt;/strong&gt;: Even with OIDC, some legacy systems require static keys. Implement a strict rotation policy. If a key is not rotated regularly, a leak might go undetected for months, giving attackers a permanent backdoor.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Does GitHub really mask all my secrets in the logs?
&lt;/h3&gt;

&lt;p&gt;No. GitHub only masks the exact string stored in the secret. If your code transforms the secret (e.g., base64 encoding, URL encoding or splitting the string), the resulting output will not be masked. Never rely on masking as a primary security control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why is &lt;code&gt;pull_request_target&lt;/code&gt; worse than &lt;code&gt;pull_request&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pull_request&lt;/code&gt; runs in the context of the merge commit and has no access to secrets from the base repository. &lt;code&gt;pull_request_target&lt;/code&gt; runs in the context of the base branch and has full access to secrets, meaning any code introduced by a contributor in a fork can access those secrets if the workflow executes that code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use OIDC for every single cloud provider?
&lt;/h3&gt;

&lt;p&gt;Yes. Every major provider (AWS, Azure, GCP and HashiCorp Vault) now supports OIDC for GitHub Actions. Moving away from static JSON keys or CSV credential files reduces your operational overhead and eliminates the risk of "stale" credentials living in your repository settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can I still use version tags like &lt;code&gt;@v4&lt;/code&gt; if I use a private runner?
&lt;/h3&gt;

&lt;p&gt;Yes, but it is still a bad practice. Even on a private runner, a compromised third-party action can exfiltrate data from your internal network or steal the &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; to modify your source code. The location of the runner does not protect you from supply chain attacks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Securing GitHub Actions requires moving away from the "trust by default" mindset. The combination of OIDC for identity, SHA pinning for supply chain integrity and strict &lt;code&gt;permissions&lt;/code&gt; blocks creates a defense-in-depth strategy. The most critical immediate step you can take is auditing your workflows for &lt;code&gt;pull_request_target&lt;/code&gt; and replacing static cloud keys with OIDC roles.&lt;/p&gt;

&lt;p&gt;Start by implementing these three actionable steps today: first, replace all &lt;code&gt;v*&lt;/code&gt; tags with commit SHAs in your most critical deployment pipeline. Second, migrate your production cloud authentication to OIDC to eliminate long-lived keys. Third, configure GitHub Environments with mandatory reviewers for all production secrets. By shifting security left into your CI/CD configuration, you ensure that your pipeline is a tool for delivery, not a liability.&lt;/p&gt;

</description>
      <category>githubactionssecurity</category>
      <category>oidcauthentication</category>
      <category>cicdhardening</category>
      <category>supplychainsecurity</category>
    </item>
    <item>
      <title>Cursor vs Copilot vs Cody: Best AI Editor for DevOps</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 21:41:26 +0000</pubDate>
      <link>https://dev.to/devopsstart/cursor-vs-copilot-vs-cody-best-ai-editor-for-devops-5a42</link>
      <guid>https://dev.to/devopsstart/cursor-vs-copilot-vs-cody-best-ai-editor-for-devops-5a42</guid>
      <description>&lt;p&gt;&lt;em&gt;Choosing the right AI editor for DevOps is about more than just autocomplete—it's about codebase context. Originally published on devopsstart.com, this guide compares Cursor, Copilot, and Cody for IaC and Kubernetes workflows.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Choosing an AI code assistant for DevOps isn't about who can write the cleanest Python function; it's about who understands the relationship between your &lt;code&gt;variables.tf&lt;/code&gt;, your Helm charts and your GitHub Actions workflow. Most AI tools are built for application developers, which means they often fail when faced with the fragmented nature of infrastructure. If you've ever had Copilot suggest a deprecated Terraform provider or a Kubernetes API version that hasn't existed since 1.16, you know the "context problem" firsthand.&lt;/p&gt;

&lt;p&gt;In this guide, you'll learn how to navigate the trade-offs between GitHub Copilot, Cursor and Sourcegraph Cody specifically through the lens of a Platform or DevOps engineer. We will dive into how each tool handles codebase indexing, how they manage the hallucinations common in YAML and HCL, and which one actually helps you reduce "time to first green build" in a complex CI/CD pipeline. By the end, you'll have a clear decision matrix to determine which tool fits your specific organizational scale, security requirements and infrastructure complexity. Whether you are managing a handful of scripts or a massive polyglot monorepo, the right choice depends on how the AI "sees" your architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Problem: Why General AI Fails DevOps
&lt;/h2&gt;

&lt;p&gt;DevOps engineers don't write linear code; they build distributed systems. A single change in a Terraform module might require updates to a Kubernetes manifest and a corresponding change in a CI pipeline. Standard AI completions fail here because they typically rely on "active tab" context. If you are editing &lt;code&gt;deployment.yaml&lt;/code&gt; but the relevant environment variable is defined in &lt;code&gt;terraform/outputs.tf&lt;/code&gt; (which is closed), the AI is guessing based on generic internet patterns, not your actual architecture.&lt;/p&gt;

&lt;p&gt;For example, imagine you are trying to reference a secret created by an ExternalSecrets operator. A generic AI will suggest a standard Kubernetes Secret syntax. A context-aware AI knows you are using &lt;code&gt;ExternalSecret&lt;/code&gt; objects and will suggest the correct API group. This is the difference between a tool that saves you five seconds of typing and a tool that prevents a production outage. To solve this, tools have moved toward Retrieval-Augmented Generation (RAG), which indexes your local or remote files to provide actual project awareness. You can read more about the complexities of managing these environments in the &lt;a href="https://dev.to/blog/kubernetes-v136-features-deprecations-upgrade-guide"&gt;Kubernetes v1.36 Features, Deprecations &amp;amp; Upgrade Guide&lt;/a&gt; to see why version-specific context is so critical.&lt;/p&gt;

&lt;p&gt;Consider this scenario: you need to add a new resource to a Terraform module that already has a strict naming convention and specific tagging requirements defined in a separate &lt;code&gt;locals.tf&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# locals.tf&lt;/span&gt;
&lt;span class="nx"&gt;locals&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;common_tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;
    &lt;span class="nx"&gt;Project&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Phoenix"&lt;/span&gt;
    &lt;span class="nx"&gt;ManagedBy&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Terraform"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# main.tf&lt;/span&gt;
&lt;span class="c1"&gt;# You start typing: resource "aws_s3_bucket" "logs" {&lt;/span&gt;
&lt;span class="c1"&gt;# A context-blind AI suggests: tags = { Name = "logs" }&lt;/span&gt;
&lt;span class="c1"&gt;# A context-aware AI suggests: tags = local.common_tags&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the AI knows your &lt;code&gt;locals.tf&lt;/code&gt; exists, it stops hallucinating generic tags and starts following your internal standards. This eliminates the manual "copy-paste" cycle that often leads to inconsistent infrastructure and failed compliance checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor: The AI-Native Powerhouse for IaC
&lt;/h2&gt;

&lt;p&gt;Cursor is not a plugin; it is a fork of VS Code. This architectural choice is a game changer for DevOps engineers because it allows the AI to integrate deeply with the IDE's indexing engine. While Copilot feels like a sophisticated autocomplete, Cursor feels like a pair programmer that has actually read your entire repository. It uses a local index of your files, meaning when you ask it to "Add a new environment to the staging cluster," it scans your existing &lt;code&gt;.tfvars&lt;/code&gt; and &lt;code&gt;kustomize&lt;/code&gt; overlays to mirror the pattern exactly.&lt;/p&gt;

&lt;p&gt;For those managing complex Terraform projects, Cursor's &lt;code&gt;@Codebase&lt;/code&gt; feature is indispensable. You can prompt the AI to analyze the relationship between different modules without opening every file. This is particularly useful when you are implementing &lt;a href="https://dev.to/blog/terraform-testing-best-practices-beyond-plan-and-pray"&gt;Terraform Testing Best Practices&lt;/a&gt; and need the AI to generate test cases based on the actual resource dependencies. In clusters with &amp;gt;100 nodes, where naming conventions are strict and dependencies are deep, this level of indexing prevents the "hallucinated resource" error that plagues plugin-based assistants.&lt;/p&gt;

&lt;p&gt;Here is how you would actually use Cursor to refactor a Kubernetes manifest to use a new ConfigMap source:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In Cursor, you use the Cmd+K (or Ctrl+K) interface.&lt;/span&gt;
&lt;span class="c"&gt;# Prompt: "@Codebase update all deployments in /k8s/overlays/prod to use the &lt;/span&gt;
&lt;span class="c"&gt;# new configmap-v2 defined in configmap.yaml"&lt;/span&gt;

&lt;span class="c"&gt;# Cursor identifies all files in the directory and applies the change:&lt;/span&gt;
&lt;span class="c"&gt;# Before:&lt;/span&gt;
&lt;span class="c"&gt;# configMapRef:&lt;/span&gt;
&lt;span class="c"&gt;#   name: app-config&lt;/span&gt;
&lt;span class="c"&gt;# After:&lt;/span&gt;
&lt;span class="c"&gt;# configMapRef:&lt;/span&gt;
&lt;span class="c"&gt;#   name: app-config-v2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The magic here is that Cursor doesn't just find and replace text; it understands that the &lt;code&gt;configMapRef&lt;/code&gt; is a Kubernetes object property. It maintains the indentation of your YAML (which is the bane of every DevOps engineer's existence) and ensures that the change is consistent across all target files. This removes the tedious manual verification usually required after a bulk edit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sourcegraph Cody: Mastering the Enterprise Monorepo
&lt;/h2&gt;

&lt;p&gt;While Cursor excels at local indexing, Sourcegraph Cody is designed for the enterprise scale. Many Platform teams work in massive polyglot monorepos where the Terraform code is in one directory, the Go-based operator is in another and the documentation is in a separate Wiki or GitHub Pages site. Cody's strength lies in its ability to pull context from remote repositories and external documentation via the Sourcegraph index.&lt;/p&gt;

&lt;p&gt;Cody is the "Enterprise Context King" because it doesn't just look at your open files; it looks at your entire organization's knowledge graph. If your company has a proprietary way of handling VPC peering or a specific wrapper around Pulumi, Cody can be configured to prioritize those internal patterns over generic public documentation. This is vital for SOC2 or HIPAA compliant environments where "following the internal standard" is not a suggestion, but a legal requirement.&lt;/p&gt;

&lt;p&gt;Imagine you are tasked with updating a CI pipeline using a custom internal GitHub Action that isn't documented on the public web.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/deploy.yml&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Internal Deploy&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-corp/deploy-helper@v2&lt;/span&gt; &lt;span class="c1"&gt;# Cody knows this action exists in your org&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cluster_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.CLUSTER_ID }}&lt;/span&gt;
          &lt;span class="c1"&gt;# Cody suggests the 'environment' input because it indexed &lt;/span&gt;
          &lt;span class="c1"&gt;# the 'deploy-helper' repo in the same organization.&lt;/span&gt;
          &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production'&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By indexing the &lt;code&gt;my-corp/deploy-helper&lt;/code&gt; repository, Cody provides suggestions for inputs and outputs that GitHub Copilot would simply guess. This reduces the need to constantly switch between your editor and the internal documentation browser. For teams implementing &lt;a href="https://dev.to/blog/gitops-testing-strategies-validate-deployments-with-argocd"&gt;GitOps Testing Strategies&lt;/a&gt;, Cody can help bridge the gap between the ArgoCD configuration and the underlying Kubernetes manifests by tracing the logic across different repositories.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparing AI Performance on YAML and HCL
&lt;/h2&gt;

&lt;p&gt;When it comes to Infrastructure as Code (IaC), the biggest risk is the "confidently wrong" suggestion. HCL (HashiCorp Configuration Language) and YAML are whitespace-sensitive and schema-dependent. GitHub Copilot is generally the fastest for simple snippets, but it is the most prone to hallucinating API versions. For example, it might suggest &lt;code&gt;apiVersion: extensions/v1beta1&lt;/code&gt; for an Ingress resource, which has been deprecated for years.&lt;/p&gt;

&lt;p&gt;Cursor and Cody perform better here because they can be anchored to specific versions of your codebase. If your project specifies Terraform v1.7.0 in a &lt;code&gt;.terraform-version&lt;/code&gt; file, Cursor is more likely to suggest syntax compatible with that version. In a head-to-head comparison for generating a complex Kubernetes NetworkPolicy, Cursor typically wins on formatting, while Cody wins on referencing your existing network architecture.&lt;/p&gt;

&lt;p&gt;Let's look at a practical comparison of how these tools handle a request to create a Kubernetes Service of type &lt;code&gt;LoadBalancer&lt;/code&gt; with specific cloud annotations for AWS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prompt: "Create a LoadBalancer service for the 'api' deployment with AWS NLB annotations"&lt;/span&gt;

&lt;span class="c1"&gt;# Copilot: Often gives a generic LoadBalancer without the specific &lt;/span&gt;
&lt;span class="c1"&gt;# service.beta.kubernetes.io/aws-load-balancer-type: nlb annotation.&lt;/span&gt;

&lt;span class="c1"&gt;# Cursor: Checks your other services, sees you use 'nlb-ip' mode, and suggests:&lt;/span&gt;
&lt;span class="c1"&gt;# annotations:&lt;/span&gt;
&lt;span class="c1"&gt;#   service.beta.kubernetes.io/aws-load-balancer-type: "nlb-ip"&lt;/span&gt;
&lt;span class="c1"&gt;#   service.beta.kubernetes.io/aws-load-balancer-scheme: "internet-facing"&lt;/span&gt;

&lt;span class="c1"&gt;# Cody: References the official AWS Load Balancer Controller docs (if indexed)&lt;/span&gt;
&lt;span class="c1"&gt;# and suggests the most current annotation for your specific K8s version.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "hallucination risk" in Kubernetes is particularly high because the API evolves so rapidly. A tool that relies on a training set from 2022 will lead you toward deprecated fields. A tool that uses RAG to look at your current &lt;code&gt;kubectl version&lt;/code&gt; or your manifest files will guide you toward the current standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for AI-Driven DevOps
&lt;/h2&gt;

&lt;p&gt;To get the most out of these tools without introducing security vulnerabilities or infrastructure drift, you must treat AI output as a "proposed change" rather than "final code." Follow these guidelines to maintain stability.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use Version Pinning in Prompts&lt;/strong&gt;: Never just ask for a "Terraform script." Specify the version. Use prompts like "Using Terraform v1.7.x and the AWS provider v5.0, create a VPC..." This forces the AI to narrow its search space and reduces the likelihood of deprecated syntax.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify with Static Analysis&lt;/strong&gt;: AI is great at writing code but terrible at verifying it. Always pipe AI-generated HCL through &lt;code&gt;terraform validate&lt;/code&gt; and YAML through &lt;code&gt;kube-linter&lt;/code&gt; or &lt;code&gt;datree&lt;/code&gt;. This catches the small indentation errors that AI frequently introduces.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context-Seed Your Prompts&lt;/strong&gt;: In Cursor or Cody, explicitly tag the files that define your architecture. Instead of "Fix this error," use "@variables.tf &lt;a class="mentioned-user" href="https://dev.to/main"&gt;@main&lt;/a&gt;.tf fix the mismatch in the subnet ID." This provides the RAG engine with a direct path to the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sanitize Secrets Before Indexing&lt;/strong&gt;: Ensure your &lt;code&gt;.gitignore&lt;/code&gt; is robust. While most modern AI editors respect &lt;code&gt;.gitignore&lt;/code&gt;, double-check that you aren't indexing &lt;code&gt;.terraform.lock.hcl&lt;/code&gt; or temporary state files that might contain sensitive metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterative Refinement&lt;/strong&gt;: Start with a high-level architecture prompt, then drill down into specific resources. Asking an AI to "Write my entire EKS cluster" usually results in a mess. Ask it to "Define the VPC," then "Define the EKS cluster using that VPC," and finally "Add the node groups."&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Which AI editor is the most secure for corporate code?
&lt;/h3&gt;

&lt;p&gt;Sourcegraph Cody generally leads in enterprise security because it offers robust controls over where data is stored and how it is indexed. For organizations with strict data residency requirements, Cody's ability to run on-premises or in a private cloud is a major advantage. Cursor and Copilot have "Privacy Modes" that promise not to train on your data, but for SOC2/HIPAA environments, the transparency of Cody's indexing layer is typically more acceptable to security auditors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can these tools actually replace writing Terraform by hand?
&lt;/h3&gt;

&lt;p&gt;No, and attempting to do so is dangerous. AI is excellent at boilerplate (creating 10 similar S3 buckets) and translation (converting a Helm chart to a Kustomize overlay), but it cannot reason about your business logic or the cost implications of a specific instance type. Use AI to handle the "syntax toil" while you handle the "architectural intent."&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I stop the AI from suggesting deprecated Kubernetes APIs?
&lt;/h3&gt;

&lt;p&gt;The best way is to provide a "source of truth" file in your repository. Create a &lt;code&gt;K8S_STANDARDS.md&lt;/code&gt; file that lists your cluster version and preferred API versions. In Cursor or Cody, refer to this file using &lt;code&gt;@K8S_STANDARDS.md&lt;/code&gt; in your prompt. This overrides the AI's general training data with your specific project requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does using a fork like Cursor break my VS Code extensions?
&lt;/h3&gt;

&lt;p&gt;Since Cursor is a fork of VS Code, it is compatible with almost all VS Code extensions. You can import your existing themes, keybindings and plugins (like the HashiCorp Terraform extension) directly. The primary difference is the built-in AI layer, which replaces the need for a separate Copilot plugin.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The transition from "AI as a plugin" to "AI as an environment" is the most significant shift in DevOps productivity since the rise of GitOps. GitHub Copilot remains a solid choice for generalists who want a low-friction experience. However, for the specialized needs of a Platform Engineer, Cursor's local codebase indexing provides a level of precision in HCL and YAML that plugins cannot match. For those operating at a massive corporate scale, Sourcegraph Cody's remote context capabilities make it the only viable choice for navigating polyglot monorepos.&lt;/p&gt;

&lt;p&gt;Your next step should be a two-week trial: install Cursor for your local feature development to see if the &lt;code&gt;@Codebase&lt;/code&gt; indexing reduces your context-switching. Simultaneously, if you are in a large team, evaluate Cody's ability to index your internal documentation. Once you've chosen your tool, integrate a static analysis step into your CI pipeline to ensure that AI-generated speed doesn't come at the cost of production stability. Stop fighting with YAML indentation and start leveraging the context of your entire architecture.&lt;/p&gt;

</description>
      <category>aicodeeditors</category>
      <category>infrastructureascode</category>
      <category>kubernetesautomation</category>
      <category>devopstools</category>
    </item>
    <item>
      <title>Build an Internal Developer Platform with Backstage and</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Mon, 20 Apr 2026 21:36:22 +0000</pubDate>
      <link>https://dev.to/devopsstart/build-an-internal-developer-platform-with-backstage-and-5gjp</link>
      <guid>https://dev.to/devopsstart/build-an-internal-developer-platform-with-backstage-and-5gjp</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop the 'ticket-ops' madness! This guide, originally published on devopsstart.com, shows you how to combine Backstage and Crossplane to build a true self-service Internal Developer Platform.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Stop forcing your developers to learn the intricacies of cloud provider consoles or struggle with 500-line Terraform modules just to get a database. The gap between raw infrastructure and developer productivity is where "ticket ops" thrives, slowing down deployment cycles and frustrating engineers. To solve this, you need an Internal Developer Platform (IDP) that abstracts infrastructure complexity into a self-service experience.&lt;/p&gt;

&lt;p&gt;An IDP allows developers to provision resources via a simplified interface without needing to be cloud experts. In this guide, you will learn how to build a production-ready IDP by combining Backstage and Crossplane. Backstage acts as your front-end portal, providing a unified interface for service discovery and software templates. Crossplane serves as the back-end control plane, turning Kubernetes into a universal API for managing cloud resources.&lt;/p&gt;

&lt;p&gt;By the end of this article, you will understand the architecture required to move from manual Infrastructure as Code (IaC) to a scalable Infrastructure as a Service (IaaS) model. You'll see exactly how to map a button click in a UI to a live AWS RDS instance via GitOps, reducing the cognitive load on your developers while maintaining strict governance for your platform team. For more on managing the underlying clusters, you can check out &lt;a href="https://dev.to/blog/kubernetes-for-beginners-deploy-your-first-application"&gt;Kubernetes for Beginners: Deploy Your First Application&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: Connecting Backstage to Crossplane
&lt;/h2&gt;

&lt;p&gt;Building an IDP isn't about one tool; it's about the pipeline. The most common mistake is trying to connect Backstage directly to a cloud API. That is a security nightmare and lacks auditability. Instead, use a GitOps-driven control plane architecture. In this flow, Backstage doesn't "create" the infrastructure; it "requests" it by committing a manifest to Git.&lt;/p&gt;

&lt;p&gt;The sequence works as follows: a developer selects a "Provision Postgres" template in the Backstage Scaffolder. Backstage then triggers a commit of a simple YAML file to a Git repository. An automated GitOps controller, such as ArgoCD, detects this change and syncs the manifest to a Kubernetes cluster. Inside that cluster, Crossplane v1.14.x sees the new Custom Resource (CR) and communicates with the cloud provider's API to provision the actual resource.&lt;/p&gt;

&lt;p&gt;This ensures that your Git history is the single source of truth, which is critical for compliance and disaster recovery. To ensure these deployments are handled reliably, you should learn &lt;a href="https://dev.to/tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation"&gt;How to Set Up Argo CD GitOps for Kubernetes Automation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The "connective tissue" here is the YAML schema. Backstage must output a manifest that exactly matches the &lt;code&gt;CompositeResourceDefinition&lt;/code&gt; (XRD) you've defined in Crossplane. If the Scaffolder outputs &lt;code&gt;db_size: small&lt;/code&gt; but Crossplane expects &lt;code&gt;storageClass: small&lt;/code&gt;, the request will hang in a "Pending" state. You must treat your XRDs as the API contract between your platform team and your developers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Abstracting Cloud Complexity with Crossplane Compositions
&lt;/h2&gt;

&lt;p&gt;If you give developers raw Crossplane resources, you've just traded Terraform for Kubernetes YAML, which does not reduce cognitive load. The real power of Crossplane lies in Compositions. A Composition allows you to bundle multiple low-level resources (like a VPC, a Subnet, and an RDS instance) into a single, high-level "Composite Resource" (XR) that developers can actually understand.&lt;/p&gt;

&lt;p&gt;For example, instead of requiring a developer to specify &lt;code&gt;db.aws.upbound.io/v1beta1&lt;/code&gt; with 20 mandatory fields, you create a &lt;code&gt;CompositeDatabase&lt;/code&gt; definition. The developer only provides a name and a size. Your platform team defines the "blueprint" that maps &lt;code&gt;size: small&lt;/code&gt; to a &lt;code&gt;t3.micro&lt;/code&gt; instance with 20GB of encrypted GP3 storage.&lt;/p&gt;

&lt;p&gt;Here is an example of a simplified &lt;code&gt;CompositeResourceDefinition&lt;/code&gt; (XRD) that defines the API your developers will use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apiextensions.crossplane.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CompositeResourceDefinition&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xpostgresdatabases.platform.example.org&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform.example.org&lt;/span&gt;
  &lt;span class="na"&gt;names&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XPostgresDatabase&lt;/span&gt;
    &lt;span class="na"&gt;plural&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xpostgresdatabases&lt;/span&gt;
  &lt;span class="na"&gt;versions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1alpha1&lt;/span&gt;
    &lt;span class="na"&gt;served&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;referenceable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;openAPIV3Schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;object&lt;/span&gt;
        &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;object&lt;/span&gt;
            &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;storageGb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer&lt;/span&gt;
              &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here is how the developer's request (the "Claim") looks. This is the exact YAML that Backstage will generate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform.example.org/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PostgresDatabase&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-service-db&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-service-prod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;storageGb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By using this approach, you eliminate the need for developers to know AWS-specific jargon. You can change the underlying instance type or backup policy in the Composition without ever touching the developer's manifest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Backstage Scaffolder for Self-Service
&lt;/h2&gt;

&lt;p&gt;The Backstage Scaffolder is the engine that turns a user's form input into a Git commit. To make this work with Crossplane, you create a &lt;code&gt;template.yaml&lt;/code&gt; file. This template defines the UI form (the questions you ask the developer) and the "steps" required to process the answer.&lt;/p&gt;

&lt;p&gt;In a production setup, your template should not just create a file; it should validate the input. For example, if a developer requests 10,000GB of storage, your template or a validating admission webhook in Kubernetes should catch it. The template uses "Nunjucks" templating to inject the form values into the Crossplane Claim YAML.&lt;/p&gt;

&lt;p&gt;Below is a snippet of a Backstage software template designed to provision a Crossplane database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage.io/template/scaffolder-entity/v1.0.0&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;provision-rds-postgres&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Provision RDS Postgres&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Creates a production-ready Postgres DB via Crossplane&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Database Details&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;dbName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Database Name&lt;/span&gt;
        &lt;span class="na"&gt;storageGb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;integer&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Storage Size (GB)&lt;/span&gt;
          &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
        &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Environment&lt;/span&gt;
          &lt;span class="na"&gt;enum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;staging&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;prod&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch-base&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch:template&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;templateRepo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;templates/infrastructure/rds&lt;/span&gt;
        &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.dbName }}&lt;/span&gt;
          &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.storageGb }}&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.environment }}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;publish:github&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;allowedStatuses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;success&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;repoUrl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.com?owner=my-org&amp;amp;repo=${{ parameters.dbName }}-infra&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the developer clicks "Create," Backstage creates a new repository (or updates an existing one) with the resulting YAML. The critical part is the &lt;code&gt;fetch:template&lt;/code&gt; step. It takes the generic &lt;code&gt;claim.yaml&lt;/code&gt; from your template repository and fills it with the user's specific requirements. This removes the possibility of syntax errors in the YAML, as the developer never actually writes the code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The GitOps Feedback Loop and Production Gotchas
&lt;/h2&gt;

&lt;p&gt;A major pain point in IDPs is the "black hole" effect: a developer clicks a button in Backstage, the commit happens, and then nothing. They have no idea if the database is actually ready or if the Crossplane provider is stuck in a back-off loop. To solve this, you must implement a feedback loop.&lt;/p&gt;

&lt;p&gt;One effective method is using the Backstage Kubernetes plugin combined with the Crossplane status fields. Crossplane updates the &lt;code&gt;status&lt;/code&gt; section of the Claim resource once the cloud provider confirms the resource is &lt;code&gt;Ready: True&lt;/code&gt;. You can configure Backstage to surface these Kubernetes resource statuses directly on the service's catalog page. If a resource is failing, the developer sees a "Warning" status in the portal, which links them to the logs.&lt;/p&gt;

&lt;p&gt;In clusters with &amp;gt;100 nodes, you'll notice that Crossplane's reconciliation loop can put significant pressure on the Kubernetes API server. I've seen cases where too many frequent updates to the status of 500+ cloud resources caused API latency. To mitigate this, tune the &lt;code&gt;pollInterval&lt;/code&gt; in your Crossplane providers. Don't check every 60 seconds if a database is ready; 5 or 10 minutes is usually sufficient for infrastructure that takes 15 minutes to provision.&lt;/p&gt;

&lt;p&gt;Another production gotcha is "orphaned resources." If a developer deletes the manifest from Git, ArgoCD deletes the Claim from Kubernetes, and Crossplane deletes the RDS instance. This is great for dev environments but catastrophic for production. You must implement a "deletion policy" in your Compositions. Set &lt;code&gt;deletionPolicy: Orphan&lt;/code&gt; for production workloads. This ensures that if the YAML is accidentally deleted, the actual cloud resource remains intact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Best Practices for Platform Engineering
&lt;/h2&gt;

&lt;p&gt;Implementing an IDP is more of an organizational challenge than a technical one. If you build a perfect platform that no one uses, you've failed. Follow these principles to ensure adoption:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the "Golden Path":&lt;/strong&gt; Do not try to automate every possible cloud resource on day one. Identify the three most requested resources (for example, S3 buckets, Postgres DBs, and Redis caches) and build high-quality templates for those. This provides immediate value and builds trust.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enforce Governance via Compositions:&lt;/strong&gt; Use Crossplane Compositions to bake in security. Ensure every S3 bucket is encrypted and every RDS instance is in a private subnet by default. The developer shouldn't even see the "Encryption" checkbox; it should be mandatory and invisible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Treat your IDP as a Product:&lt;/strong&gt; Your developers are your customers. Conduct user interviews to find where the friction is. If they find the Backstage form too long, simplify it. If they need more visibility into costs, integrate a cost-tracking plugin.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement Strong RBAC:&lt;/strong&gt; Use Kubernetes namespaces to isolate claims. Ensure that a developer in the &lt;code&gt;team-a&lt;/code&gt; namespace cannot modify a &lt;code&gt;PostgresDatabase&lt;/code&gt; claim in the &lt;code&gt;team-b&lt;/code&gt; namespace. Use a tool like Kyverno to enforce these boundaries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version your Compositions:&lt;/strong&gt; When you update a Composition (for example, upgrading the RDS instance class), don't just push it to production. Version your XRDs and Compositions so you can migrate services gradually rather than forcing a global update.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How does this approach differ from using Terraform with a CI/CD pipeline?&lt;/strong&gt;&lt;br&gt;
Traditional Terraform requires a "push" model where a pipeline runs &lt;code&gt;terraform apply&lt;/code&gt;. This often leads to state locking issues and configuration drift. The Backstage + Crossplane approach uses a "pull" model (Control Plane). Crossplane constantly monitors the state of the cloud and automatically corrects drift without needing a manual pipeline trigger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this mean I have to migrate all my existing Terraform code to Crossplane?&lt;/strong&gt;&lt;br&gt;
No. You can run them side-by-side. Use Crossplane for new, self-service workloads while keeping your core networking and foundation (VPCs, IAM roles) in Terraform. You can even use the Terraform provider for Crossplane to manage existing Terraform modules through the Kubernetes API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens if the cloud provider API is down during provisioning?&lt;/strong&gt;&lt;br&gt;
Crossplane employs an exponential back-off strategy. If the AWS API returns a 500 error, Crossplane will keep retrying the request. The Kubernetes resource will stay in a &lt;code&gt;Synced: False&lt;/code&gt; state. Because you have a GitOps audit trail, you can easily see which resources are stuck.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Is Backstage overkill for small teams?&lt;/strong&gt;&lt;br&gt;
If you have fewer than five developers, a simple README and a set of shared Terraform modules might suffice. However, once you hit a scale where the platform team becomes a bottleneck for "simple" requests, the investment in Backstage pays off by eliminating the ticket queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Combining Backstage and Crossplane allows you to move from a culture of "ticket-based infrastructure" to true self-service. By using Backstage as the user interface and Crossplane as the control plane, you create a system where developers can provision production-ready resources in minutes, not days. This doesn't just speed up delivery; it allows your platform engineers to stop performing repetitive manual tasks and start focusing on high-value architectural improvements.&lt;/p&gt;

&lt;p&gt;To get started, your first actionable step is to install Crossplane v1.14.x on a development cluster and create your first &lt;code&gt;CompositeResourceDefinition&lt;/code&gt; for a simple resource, like an S3 bucket. Once the API is working, set up a basic Backstage instance and create a software template that outputs the YAML required by that XRD. Start small, validate the "Golden Path" with one team, and then scale the platform to the rest of your organization.&lt;/p&gt;

</description>
      <category>platformengineering</category>
      <category>backstageio</category>
      <category>crossplane</category>
      <category>internaldeveloperplatform</category>
    </item>
    <item>
      <title>Essential kubectl Commands Cheat Sheet</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:36:24 +0000</pubDate>
      <link>https://dev.to/devopsstart/essential-kubectl-commands-cheat-sheet-2elo</link>
      <guid>https://dev.to/devopsstart/essential-kubectl-commands-cheat-sheet-2elo</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop memorizing every flag! I've put together a handy kubectl cheat sheet for managing pods, deployments, and debugging. Originally published on devopsstart.com.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pod Management
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all pods in current namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get pods -A&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List pods across all namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl describe pod &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show detailed pod information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl delete pod &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Delete a specific pod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl logs &amp;lt;pod&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View pod logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl logs &amp;lt;pod&amp;gt; -f&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stream pod logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl exec -it &amp;lt;pod&amp;gt; -- sh&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Open shell in pod&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Deployments
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get deployments&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl scale deploy &amp;lt;name&amp;gt; --replicas=3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Scale a deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl rollout status deploy/&amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check rollout status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl rollout undo deploy/&amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Rollback a deployment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl set image deploy/&amp;lt;name&amp;gt; &amp;lt;container&amp;gt;=&amp;lt;image&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Update container image&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Services and Networking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get svc&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get ingress&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all ingress resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl port-forward svc/&amp;lt;name&amp;gt; 8080:80&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Forward local port to service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get endpoints&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List service endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Debugging
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get events --sort-by=.metadata.creationTimestamp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View cluster events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl top pods&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show pod resource usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl top nodes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show node resource usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl logs &amp;lt;pod&amp;gt; --previous&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;View logs from crashed container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl describe node &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Check node conditions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Context and Config
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config get-contexts&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List all contexts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config use-context &amp;lt;name&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Switch context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config current-context&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Show current context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl get ns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;List namespaces&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kubectl config set-context --current --namespace=&amp;lt;ns&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Set default namespace&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>kubernetes</category>
      <category>kubectl</category>
      <category>cheatsheet</category>
    </item>
    <item>
      <title>Debug Kubernetes CrashLoopBackOff in 30 Seconds</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:31:20 +0000</pubDate>
      <link>https://dev.to/devopsstart/debug-kubernetes-crashloopbackoff-in-30-seconds-1c7c</link>
      <guid>https://dev.to/devopsstart/debug-kubernetes-crashloopbackoff-in-30-seconds-1c7c</guid>
      <description>&lt;p&gt;&lt;em&gt;Struggling with a pod stuck in CrashLoopBackOff? This quick guide, originally published on devopsstart.com, shows you the exact commands to diagnose the root cause in seconds.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Your pod is stuck in &lt;code&gt;CrashLoopBackOff&lt;/code&gt; and you need to find out why — fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;--previous&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--previous&lt;/code&gt; flag shows logs from the last crashed container instance. This is the single most useful flag for debugging crash loops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combine with describe for the full picture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 &lt;span class="s2"&gt;"Last State"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows the exit code and reason for the last termination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Last State:  Terminated
  Reason:    OOMKilled
  Exit Code: 137
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Exit Codes
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Application error&lt;/td&gt;
&lt;td&gt;Check app logs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;137&lt;/td&gt;
&lt;td&gt;OOMKilled&lt;/td&gt;
&lt;td&gt;Increase memory limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;139&lt;/td&gt;
&lt;td&gt;Segfault&lt;/td&gt;
&lt;td&gt;Check binary compatibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;143&lt;/td&gt;
&lt;td&gt;SIGTERM&lt;/td&gt;
&lt;td&gt;Graceful shutdown issue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why It Works
&lt;/h2&gt;

&lt;p&gt;Kubernetes keeps logs from the previous container instance even after it crashes. Without &lt;code&gt;--previous&lt;/code&gt;, you'd only see logs from the current (possibly empty) instance that hasn't had time to produce output before crashing again.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>debugging</category>
      <category>pods</category>
    </item>
    <item>
      <title>Rapid Rollback: `kubectl set image` for Urgent Fixes</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:26:17 +0000</pubDate>
      <link>https://dev.to/devopsstart/rapid-rollback-kubectl-set-image-for-urgent-fixes-52l5</link>
      <guid>https://dev.to/devopsstart/rapid-rollback-kubectl-set-image-for-urgent-fixes-52l5</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published on devopsstart.com. When production breaks, every second counts—here is how to use &lt;code&gt;kubectl set image&lt;/code&gt; for a precise and rapid rollback.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;You've just deployed a new container image to production, and almost immediately, monitoring alerts start screaming. Latency is spiking, error rates are through the roof, and your customers are experiencing service degradation. In these high-pressure moments, a fast, reliable rollback mechanism is critical. While Kubernetes offers robust rollout and rollback capabilities via &lt;code&gt;kubectl rollout undo&lt;/code&gt;, there are specific scenarios where &lt;code&gt;kubectl set image&lt;/code&gt; can provide a quicker, more direct path to recovery, especially when you know &lt;em&gt;exactly&lt;/em&gt; which image version you need to revert to.&lt;/p&gt;

&lt;p&gt;This tip focuses on leveraging &lt;code&gt;kubectl set image&lt;/code&gt; for urgent rollbacks. You'll learn when this command is most effective, how to accurately identify the correct previous image tag, and how to execute the command to quickly stabilize your application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding &lt;code&gt;kubectl set image&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;kubectl set image&lt;/code&gt; command is primarily designed to atomically update the image of one or more specific containers within a Kubernetes resource. It typically targets Deployments, StatefulSets, DaemonSets, or ReplicationControllers. When executed, it modifies the resource's Pod template to point to the new image tag, which then triggers a new rolling update.&lt;/p&gt;

&lt;p&gt;While &lt;code&gt;kubectl set image&lt;/code&gt; is frequently used for &lt;em&gt;forward&lt;/em&gt; deployments (e.g., updating &lt;code&gt;v1.1.9&lt;/code&gt; to &lt;code&gt;v1.2.0&lt;/code&gt;), its direct nature makes it exceptionally well-suited for rapid rollbacks. When you specify a previous, stable image, Kubernetes initiates a new rollout toward that desired state. This behavior differentiates it from &lt;code&gt;kubectl rollout undo&lt;/code&gt;, which inherently steps back through the deployment's recorded history, revision by revision.&lt;/p&gt;

&lt;p&gt;Here’s a common example of how &lt;code&gt;kubectl set image&lt;/code&gt; is used to update an image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;set &lt;/span&gt;image deployment/my-app my-container&lt;span class="o"&gt;=&lt;/span&gt;my-registry/my-app:v1.2.0 &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deployment.apps/my-app image updated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command updates &lt;code&gt;my-container&lt;/code&gt; within the &lt;code&gt;my-app&lt;/code&gt; deployment in the &lt;code&gt;production&lt;/code&gt; namespace to use the &lt;code&gt;v1.2.0&lt;/code&gt; image from &lt;code&gt;my-registry&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Urgent Rollback Scenario
&lt;/h2&gt;

&lt;p&gt;Consider this scenario: your &lt;code&gt;my-app:v1.2.0&lt;/code&gt; release introduced a critical bug that bypassed your staging environment checks. You pushed it to production an hour ago, and now, critical alerts are firing, indicating significant application failures. You need to revert to the last known good image, let's say &lt;code&gt;my-app:v1.1.9&lt;/code&gt;, &lt;em&gt;immediately&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Why might &lt;code&gt;kubectl set image&lt;/code&gt; be preferred over &lt;code&gt;kubectl rollout undo&lt;/code&gt; in such a situation?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Directness and Precision:&lt;/strong&gt; If you know the exact, stable image tag to which you need to revert, &lt;code&gt;kubectl set image&lt;/code&gt; offers an explicit and precise command. This avoids ambiguity and ensures you land on the intended stable state directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bypassing Unhealthy Revisions:&lt;/strong&gt; If multiple faulty deployments occurred after your last stable one (e.g., you tried &lt;code&gt;v1.2.0&lt;/code&gt;, then &lt;code&gt;v1.2.1-hotfix&lt;/code&gt;, both failed), &lt;code&gt;kubectl rollout undo&lt;/code&gt; would sequentially step back through these potentially problematic revisions. &lt;code&gt;kubectl set image&lt;/code&gt; allows you to jump directly to the known good &lt;code&gt;v1.1.9&lt;/code&gt; without traversing the unstable intermediate states.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forced Redeploy (Edge Cases):&lt;/strong&gt; In rare cases, even if an image tag is theoretically the same, you might want to force Kubernetes to re-pull container images and redeploy pods due to local caching issues or other inconsistencies. Re-setting the image explicitly with &lt;code&gt;kubectl set image&lt;/code&gt; can achieve this, ensuring fresh pods are created. For more on debugging common Kubernetes issues, refer to our article on &lt;a href="https://dev.to/troubleshooting/crashloopbackoff-kubernetes"&gt;Troubleshooting CrashLoopBackOff in Kubernetes&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Identifying the Previous Image Tag
&lt;/h2&gt;

&lt;p&gt;The critical first step for a &lt;code&gt;kubectl set image&lt;/code&gt; rollback is accurately identifying the last known good image tag. You can achieve this by inspecting your deployment's revision history:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Check Rollout History:&lt;/strong&gt;&lt;br&gt;
This command provides a concise summary of your deployment's revision history, showing the changes made at each step.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout &lt;span class="nb"&gt;history &lt;/span&gt;deployment/my-app &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
shell

    **Expected output:**


```bash
    deployment.apps/my-app 
    REVISION  CHANGE-CAUSE
    1         &amp;lt;none&amp;gt;
    2         my-container: my-registry/my-app:v1.1.8
    3         my-container: my-registry/my-app:v1.1.9
    4         my-container: my-registry/my-app:v1.2.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;From this output, if `v1.2.0` (revision 4) is currently causing issues, then `v1.1.9` (revision 3) is your immediate target for rollback. Note that `CHANGE-CAUSE` may also contain details if `--record` was used during deployment.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Describe a Specific Revision (Optional Verification):&lt;/strong&gt;&lt;br&gt;
To be absolutely certain about the container images used in a particular revision, you can describe it in detail. This is a good verification step before initiating a rollback.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout &lt;span class="nb"&gt;history &lt;/span&gt;deployment/my-app &lt;span class="nt"&gt;--revision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3 &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
shell

    **Expected (truncated) output:**


```bash
    deployment.apps/my-app with revision 3
    Pod Template:
      Labels:       app=my-app
                    pod-template-hash=54c9c76...
      Containers:
        my-container:
          Image:        my-registry/my-app:v1.1.9
          Port:         8080/TCP
          Environment:  &amp;lt;none&amp;gt;
    ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This confirms that `my-registry/my-app:v1.1.9` was indeed the image used for revision 3, making it a reliable rollback target.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Executing the &lt;code&gt;kubectl set image&lt;/code&gt; Rollback
&lt;/h2&gt;

&lt;p&gt;Once you have identified the precise desired image tag (e.g., &lt;code&gt;my-registry/my-app:v1.1.9&lt;/code&gt; in our example), executing the rollback is straightforward and immediate:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nb"&gt;set &lt;/span&gt;image deployment/my-app my-container&lt;span class="o"&gt;=&lt;/span&gt;my-registry/my-app:v1.1.9 &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;deployment.apps/my-app image updated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Upon execution, Kubernetes will immediately initiate a new rolling update. It will begin replacing the currently failing &lt;code&gt;v1.2.0&lt;/code&gt; pods with new ones running the specified stable &lt;code&gt;v1.1.9&lt;/code&gt; image.&lt;/p&gt;

&lt;p&gt;You can monitor the progress of this new rollout using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout status deployment/my-app &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output during rollout:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Waiting &lt;span class="k"&gt;for &lt;/span&gt;deployment &lt;span class="s2"&gt;"my-app"&lt;/span&gt; rollout to finish: 1 old replicas are pending termination...
Waiting &lt;span class="k"&gt;for &lt;/span&gt;deployment &lt;span class="s2"&gt;"my-app"&lt;/span&gt; rollout to finish: 1 old replicas are pending termination...
deployment &lt;span class="s2"&gt;"my-app"&lt;/span&gt; successfully rolled out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the rollout is complete, your application should be consistently running the stable &lt;code&gt;v1.1.9&lt;/code&gt; image, and your monitoring alerts should ideally begin to subside as service is restored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Important Considerations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rollback Strategy Impact:&lt;/strong&gt; This &lt;code&gt;kubectl set image&lt;/code&gt; method performs a rolling update. It's crucial that your application is designed to handle a brief period where both the old (problematic) and new (stable) versions of pods are running concurrently. This typically means ensuring backward and forward compatibility for APIs and data schemas.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Immutability:&lt;/strong&gt; Always strive to use immutable image tags (e.g., &lt;code&gt;v1.1.9&lt;/code&gt;, &lt;code&gt;v1.2.0&lt;/code&gt;, &lt;code&gt;sha256:abcdef...&lt;/code&gt;) rather than mutable tags like &lt;code&gt;latest&lt;/code&gt;. Immutable tags guarantee that a specific tag always refers to the exact same image content, which is fundamental for reliable and reproducible rollbacks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditing and History:&lt;/strong&gt; Using &lt;code&gt;kubectl set image&lt;/code&gt; creates a new revision in the deployment's history. This automatically ensures that your rollback action is recorded, providing a clear audit trail of changes made to your deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stateful Workloads:&lt;/strong&gt; For StatefulSets, exercising caution when changing image versions is paramount. If a new image version introduces changes that affect persistent storage or state, a simple image rollback might not fully resolve database schema migrations or data portability issues. Always understand the data implications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;When a problematic image release throws production into disarray, reaction time is paramount. While &lt;code&gt;kubectl rollout undo&lt;/code&gt; is a valuable tool, &lt;code&gt;kubectl set image&lt;/code&gt; provides a direct, efficient, and precise alternative for reverting to a specific, known-good image. This capability can significantly reduce Mean Time To Recovery (MTTR) by allowing you to bypass potentially multiple failing revisions and jump straight to stability. By understanding your deployment history and precisely targeting the last stable&lt;/p&gt;

</description>
      <category>kubectl</category>
      <category>kubernetes</category>
      <category>rollback</category>
      <category>deployment</category>
    </item>
    <item>
      <title>How to Set Up Argo CD GitOps for Kubernetes Automation</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:21:13 +0000</pubDate>
      <link>https://dev.to/devopsstart/how-to-set-up-argo-cd-gitops-for-kubernetes-automation-1l3g</link>
      <guid>https://dev.to/devopsstart/how-to-set-up-argo-cd-gitops-for-kubernetes-automation-1l3g</guid>
      <description>&lt;p&gt;&lt;em&gt;Stop relying on manual kubectl applies and start treating your cluster as code. This comprehensive guide, originally published on devopsstart.com, walks you through setting up Argo CD for true GitOps automation.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you are still running &lt;code&gt;kubectl apply -f manifests/&lt;/code&gt; from your local machine or a Jenkins pipeline, you are operating in a "Push" model. In this model, your CI tool needs high-privileged credentials to your cluster, and you have no guarantee that what is actually running in production matches what is in your Git repository. One rogue developer running a manual &lt;code&gt;kubectl edit&lt;/code&gt; can create a "configuration drift" that haunts you for months.&lt;/p&gt;

&lt;p&gt;This is where GitOps comes in. GitOps is a paradigm where Git is the single source of truth for your infrastructure and application state. Instead of pushing changes to the cluster, a controller inside the cluster constantly monitors your Git repo and "pulls" the state to match.&lt;/p&gt;

&lt;p&gt;In this tutorial (Part 1 of our series), we will move from imperative deployments to declarative continuous delivery using Argo CD v2.11.0. You'll learn how to install Argo CD, connect your repositories, handle configuration drift, and scale your deployments using ApplicationSets. By the end, you'll have a production-ready GitOps engine that ensures your cluster is always in the desired state. For a deeper dive into how this compares to other tools, check out our guide on /blog/argo-cd-vs-flux-a-guide-for-multi-cluster-gitops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before we start, you need a working Kubernetes cluster. This can be a managed service like EKS, GKE, or AKS, or a local setup like Kind or Minikube. For this tutorial, we assume you are using Kubernetes v1.30 or newer.&lt;/p&gt;

&lt;p&gt;You will need the following tools installed on your local workstation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubectl v1.30+&lt;/strong&gt;: The standard Kubernetes CLI. Ensure your context is set to the correct cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git v2.40+&lt;/strong&gt;: Required for managing the manifests that Argo CD will track.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A GitHub or GitLab account&lt;/strong&gt;: You need a repository to store your Kubernetes YAML manifests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic YAML knowledge&lt;/strong&gt;: You should know how to write a basic Deployment and Service manifest. If you are totally new to this, refer to our /blog/kubernetes-for-beginners-deploy-your-first-application.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You don't need to install the Argo CD CLI for the basic setup, as we will use the Web UI and &lt;code&gt;kubectl&lt;/code&gt; for most operations, but having it installed is helpful for advanced automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;The goal of this tutorial is to build a fully automated deployment pipeline where a Git commit is the only trigger needed to update your application. We aren't just deploying a "Hello World" app; we are building a scalable architecture.&lt;/p&gt;

&lt;p&gt;Here is the architecture we will implement:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The Manifest Repo&lt;/strong&gt;: A dedicated Git repository containing the desired state of your cluster (YAML files).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argo CD Controller&lt;/strong&gt;: Installed in the &lt;code&gt;argocd&lt;/code&gt; namespace, acting as the GitOps operator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Application CRD&lt;/strong&gt;: A custom resource that tells Argo CD: "Watch this folder in Git and make sure it exists in this namespace in the cluster."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Sync&lt;/strong&gt;: A policy that automatically corrects any manual changes (drift) made to the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ApplicationSets&lt;/strong&gt;: A template-based approach to deploy the same application across multiple namespaces (e.g., &lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, &lt;code&gt;prod&lt;/code&gt;) without duplicating YAML files.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;By moving to this "Pull" model, you eliminate the need to store Kubeconfigs in your CI tool (like GitHub Actions or GitLab CI), which reduces the attack surface of your infrastructure by removing high-privileged secrets from external runners.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Installing Argo CD
&lt;/h2&gt;

&lt;p&gt;Argo CD is installed as a set of deployments and services within your cluster. We will use the official manifests provided by the Argo project.&lt;/p&gt;

&lt;p&gt;First, create a dedicated namespace for Argo CD to keep the installation isolated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, apply the installation manifest. We will use the stable release manifest for v2.11.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/argoproj/argo-cd/v2.11.0/manifests/install.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for all pods to reach the &lt;code&gt;Running&lt;/code&gt; state. You can monitor the progress with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output should look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                                                 READY   STATUS    RESTARTS   AGE
argocd-server-7d5f8f8f5-abc12                       1/1     Running   0          2m
argocd-repo-server-5f4d7e9-def34                     1/1     Running   0          2m
argocd-application-controller-f7e8d9-ghi56           1/1     Running   0          2m
argocd-redis-7c8b9a0-jkl78                           1/1     Running   0          2m
argocd-notifications-controller-h9i0j1-mno90         1/1     Running   0          2m
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If any pods stay in &lt;code&gt;Pending&lt;/code&gt; or &lt;code&gt;CrashLoopBackOff&lt;/code&gt;, you can diagnose the issue using our guide on /troubleshooting/crashloopbackoff-kubernetes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Initial Access and Authentication
&lt;/h2&gt;

&lt;p&gt;By default, the Argo CD API server is not exposed to the public internet. For this tutorial, we will use port-forwarding to access the UI.&lt;/p&gt;

&lt;p&gt;Start the port-forward in a separate terminal window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward svc/argocd-server &lt;span class="nt"&gt;-n&lt;/span&gt; argocd 8080:443
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, open your browser and go to &lt;code&gt;https://localhost:8080&lt;/code&gt;. You will see a login screen. The default username is &lt;code&gt;admin&lt;/code&gt;. The password, however, is automatically generated and stored in a Kubernetes secret.&lt;/p&gt;

&lt;p&gt;Run the following command to retrieve the initial admin password:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get secret argocd-initial-admin-secret &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"{.data.password}"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output will be a plain-text string, for example: &lt;code&gt;xYz123AbC456DefG&lt;/code&gt;. Copy this password and use it to log in to the UI.&lt;/p&gt;

&lt;p&gt;Once you log in, change the admin password under the "User Management" settings. For production environments, avoid using the initial admin account and instead integrate with an OIDC provider like Okta or GitHub. You can find more details in the &lt;a href="https://argo-cd.readthedocs.io/en/stable/operator-manual/installation/" rel="noopener noreferrer"&gt;official Argo CD documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Connecting Your Git Repository
&lt;/h2&gt;

&lt;p&gt;Argo CD needs permission to read your manifests. If your repository is public, you can just provide the URL. If it is private, you need to provide SSH keys or HTTPS credentials.&lt;/p&gt;

&lt;p&gt;Let's assume you have a private GitHub repository located at &lt;code&gt;git@github.com:your-org/gitops-manifests.git&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Log in to the Argo CD UI.&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings&lt;/strong&gt; $\rightarrow$ &lt;strong&gt;Repositories&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Connect Repo&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Select &lt;strong&gt;via SSH&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Enter the Repository URL: &lt;code&gt;git@github.com:your-org/gitops-manifests.git&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Paste your private SSH key (the one that has read access to the repo).&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Connect&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the connection is successful, the status will change to &lt;code&gt;Successful&lt;/code&gt;. If you see &lt;code&gt;Failed&lt;/code&gt;, ensure your SSH key is correct and that the Argo CD pod has outbound network access to GitHub.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Creating Your First GitOps Application
&lt;/h2&gt;

&lt;p&gt;In Argo CD, an "Application" is a Custom Resource (CRD) that defines the link between a source (Git) and a destination (Cluster).&lt;/p&gt;

&lt;p&gt;We will create a simple application that deploys a guestbook app. First, ensure your Git repo has a folder named &lt;code&gt;guestbook&lt;/code&gt; containing a &lt;code&gt;deployment.yaml&lt;/code&gt; and a &lt;code&gt;service.yaml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Using YAML is the "GitOps way" because you can store the Application definition itself in Git (the App-of-Apps pattern). Save the following as &lt;code&gt;guestbook-app.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-gitops&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;git@github.com:your-org/gitops-manifests.git'&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://kubernetes.default.svc'&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-demo&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Component Breakdown
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;source&lt;/strong&gt;: This tells Argo CD where to look. &lt;code&gt;targetRevision: HEAD&lt;/code&gt; tracks the latest commit on the default branch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;destination&lt;/strong&gt;: &lt;code&gt;https://kubernetes.default.svc&lt;/code&gt; refers to the cluster where Argo CD is currently installed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;syncPolicy&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;automated&lt;/code&gt;: Argo CD will automatically apply changes from Git to the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prune: true&lt;/code&gt;: If you delete a file from Git, Argo CD will delete the corresponding resource from Kubernetes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;selfHeal: true&lt;/code&gt;: If someone manually edits a resource in the cluster, Argo CD will instantly overwrite it with the Git version.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CreateNamespace=true&lt;/code&gt;: Ensures the &lt;code&gt;guestbook-demo&lt;/code&gt; namespace is created if it doesn't exist.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Apply this manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; guestbook-app.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, go to the Argo CD UI. You will see the &lt;code&gt;guestbook-gitops&lt;/code&gt; application. Initially, it will be "OutOfSync" while it calculates the difference, then it will transition to "Synced" and "Healthy" as it creates the pods and services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Handling Configuration Drift
&lt;/h2&gt;

&lt;p&gt;Configuration drift occurs when the actual state of the cluster deviates from the desired state defined in Git. This usually happens when a developer uses &lt;code&gt;kubectl edit&lt;/code&gt; to fix a production bug quickly but forgets to update the Git repo.&lt;/p&gt;

&lt;p&gt;Let's simulate this. We will manually scale our deployment to 5 replicas using the CLI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl scale deployment guestbook &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5 &lt;span class="nt"&gt;-n&lt;/span&gt; guestbook-demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you check the Argo CD UI now, the application status has changed from &lt;code&gt;Synced&lt;/code&gt; to &lt;code&gt;OutOfSync&lt;/code&gt;. The UI will highlight the exact difference in yellow: "Desired: 3 replicas, Actual: 5 replicas."&lt;/p&gt;

&lt;p&gt;Because we enabled &lt;code&gt;selfHeal: true&lt;/code&gt;, you won't have to do anything. Within a few seconds, Argo CD will detect the drift and automatically scale the deployment back down to 3 replicas to match Git.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;selfHeal&lt;/code&gt; were disabled, the application would stay &lt;code&gt;OutOfSync&lt;/code&gt;. You would then have two choices:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sync&lt;/strong&gt;: Click the "Sync" button in the UI to force the cluster to match Git.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update Git&lt;/strong&gt;: Change the replica count in your Git repo to 5, commit, and push. Argo CD would then see the new desired state and update the cluster.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This mechanism ensures that your Git history is an audit log of every change ever made to your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Scaling with ApplicationSets
&lt;/h2&gt;

&lt;p&gt;Creating one &lt;code&gt;Application&lt;/code&gt; manifest for one app is easy. But if you have 50 microservices across &lt;code&gt;dev&lt;/code&gt;, &lt;code&gt;staging&lt;/code&gt;, and &lt;code&gt;prod&lt;/code&gt; clusters, creating 150 &lt;code&gt;Application&lt;/code&gt; manifests is a maintenance nightmare.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ApplicationSets&lt;/code&gt; allow you to use a template to generate multiple Applications automatically. They use "generators" to discover targets.&lt;/p&gt;

&lt;p&gt;Let's use a &lt;strong&gt;List Generator&lt;/strong&gt; to deploy the same guestbook app into three different namespaces. Save this as &lt;code&gt;guestbook-appset.yaml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-environments&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;engineering-dev&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-dev&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;engineering-staging&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-staging&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;engineering-prod&lt;/span&gt;
            &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-prod&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}-guestbook'&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;git@github.com:your-org/gitops-manifests.git'&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://kubernetes.default.svc'&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{namespace}}'&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the ApplicationSet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; guestbook-appset.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Argo CD iterates through the list and creates three separate &lt;code&gt;Application&lt;/code&gt; resources: &lt;code&gt;engineering-dev-guestbook&lt;/code&gt;, &lt;code&gt;engineering-staging-guestbook&lt;/code&gt;, and &lt;code&gt;engineering-prod-guestbook&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you need to add a new environment (e.g., &lt;code&gt;qa&lt;/code&gt;), you simply add one line to the &lt;code&gt;elements&lt;/code&gt; list in the ApplicationSet and commit. For teams managing huge fleets of clusters, the &lt;strong&gt;Cluster Generator&lt;/strong&gt; is even more powerful; it can automatically detect every cluster registered in Argo CD and deploy a "base" set of tools to all of them without any manual listing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 7: Integrating the Full CI/CD Pipeline
&lt;/h2&gt;

&lt;p&gt;Now that the "CD" (Continuous Delivery) part is handled by Argo CD, how does the "CI" (Continuous Integration) part fit in?&lt;/p&gt;

&lt;p&gt;A common mistake is letting the CI tool (GitHub Actions, Jenkins, CircleCI) call &lt;code&gt;kubectl apply&lt;/code&gt;. This breaks the GitOps model. Instead, the CI tool should only be responsible for updating the manifest repository.&lt;/p&gt;

&lt;p&gt;Here is the professional workflow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Developer Pushes Code&lt;/strong&gt;: A developer pushes a change to the application source code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI Pipeline Runs&lt;/strong&gt;: GitHub Actions triggers a build, runs tests, and builds a Docker image.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Push&lt;/strong&gt;: The CI tool pushes the image to a registry (e.g., Amazon ECR) with a unique tag (the Git SHA).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifest Update&lt;/strong&gt;: The CI tool clones the &lt;code&gt;gitops-manifests&lt;/code&gt; repo and updates the image tag in the &lt;code&gt;deployment.yaml&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git Commit&lt;/strong&gt;: The CI tool commits and pushes the change back to the manifest repo.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argo CD Pull&lt;/strong&gt;: Argo CD detects the commit in the manifest repo and pulls the change into the cluster.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Example GitHub Action snippet for manifest update
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Kubernetes image tag&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;git clone https://x-token-auth:${{ secrets.GITOPS_TOKEN }}@github.com/your-org/gitops-manifests.git&lt;/span&gt;
    &lt;span class="s"&gt;cd gitops-manifests&lt;/span&gt;
    &lt;span class="s"&gt;sed -i "s|image: my-app:.*|image: my-app:${{ github.sha }}|g" guestbook/deployment.yaml&lt;/span&gt;
    &lt;span class="s"&gt;git config user.name "GitHub Action"&lt;/span&gt;
    &lt;span class="s"&gt;git config user.email "action@github.com"&lt;/span&gt;
    &lt;span class="s"&gt;git add .&lt;/span&gt;
    &lt;span class="s"&gt;git commit -m "Update guestbook image to ${{ github.sha }}"&lt;/span&gt;
    &lt;span class="s"&gt;git push&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This separation of concerns is critical. The CI tool has no access to the cluster; it only has access to a Git repository. If your CI tool is compromised, the attacker cannot delete your production pods; they can only propose changes to Git, which can be blocked by a Pull Request review.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Application stuck in "Progressing" status
&lt;/h3&gt;

&lt;p&gt;If your application is "Synced" but stays in "Progressing", the pods are likely failing to start. Argo CD is waiting for the Kubernetes health check to return &lt;code&gt;Healthy&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Check the pod events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; guestbook-demo
kubectl describe pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; guestbook-demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for &lt;code&gt;ImagePullBackOff&lt;/code&gt; or &lt;code&gt;CrashLoopBackOff&lt;/code&gt;. If you see the latter, use the tips in /tips/debug-crashloopbackoff to find the root cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Repository Connection Failed
&lt;/h3&gt;

&lt;p&gt;If Argo CD cannot connect to your Git repo, check the &lt;code&gt;argocd-repo-server&lt;/code&gt; logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/name&lt;span class="o"&gt;=&lt;/span&gt;argocd-repo-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common causes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incorrect SSH private key.&lt;/li&gt;
&lt;li&gt;Firewall rules blocking port 22 (SSH) or 443 (HTTPS) from the cluster to GitHub.&lt;/li&gt;
&lt;li&gt;Using a GitHub Deploy Key that doesn't have read access to the repository.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. "OutOfSync" loop
&lt;/h3&gt;

&lt;p&gt;Sometimes an application flips between &lt;code&gt;Synced&lt;/code&gt; and &lt;code&gt;OutOfSync&lt;/code&gt; rapidly. This is often caused by a conflict between Argo CD and another controller (like a Horizontal Pod Autoscaler).&lt;/p&gt;

&lt;p&gt;If HPA is changing the replica count and Argo CD is trying to force it back to the Git value, they will fight forever. To fix this, ignore the &lt;code&gt;replicas&lt;/code&gt; field in the Application spec:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
      &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/spec/replicas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Can Argo CD manage Helm charts?&lt;/strong&gt;&lt;br&gt;
A: Yes. Argo CD natively supports Helm. You can point the &lt;code&gt;source&lt;/code&gt; to a Helm repository or a folder containing a &lt;code&gt;Chart.yaml&lt;/code&gt;. You can also provide a &lt;code&gt;values.yaml&lt;/code&gt; file in Git to override default settings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What is the "App-of-Apps" pattern?&lt;/strong&gt;&lt;br&gt;
A: This is a pattern where you create one "Root" Argo CD Application that points to a folder containing other Application manifests. This allows you to manage your entire cluster state (including other apps) using a single GitOps entry point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Does Argo CD support multi-cluster management?&lt;/strong&gt;&lt;br&gt;
A: Yes. You can add external clusters to Argo CD via the CLI or UI. Once added, you can set the &lt;code&gt;destination.server&lt;/code&gt; in your Application manifest to the API server URL of the remote cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;You have now transitioned from manual &lt;code&gt;kubectl&lt;/code&gt; deployments to a professional GitOps workflow. We have installed Argo CD v2.11.0, connected a private repository, and deployed an application using the declarative model. By implementing &lt;code&gt;selfHeal&lt;/code&gt;, you've ensured that your cluster is resilient to manual configuration drift. Furthermore, by using &lt;code&gt;ApplicationSets&lt;/code&gt;, you've built a foundation that can scale from one application to hundreds across multiple environments.&lt;/p&gt;

&lt;p&gt;The key takeaway is the shift in trust. You no longer trust the state of the cluster; you trust the state of Git. This makes your deployments repeatable, auditable, and significantly more secure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Actionable next steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Migrate one existing production service to Argo CD.&lt;/li&gt;
&lt;li&gt;Set up a "Management" repo that uses the App-of-Apps pattern to manage all your other Application manifests.&lt;/li&gt;
&lt;li&gt;Implement a PR-based workflow where no one is allowed to push directly to the main branch of your manifest repo.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In Part 2 of this series, we will cover advanced Argo CD features, including Blue/Green and Canary deployments using Argo Rollouts, and how to integrate Prometheus metrics to trigger automatic rollbacks if a new release increases your error rate.&lt;/p&gt;

</description>
      <category>argocd</category>
      <category>gitopsworkflow</category>
      <category>kubernetescd</category>
      <category>argocdapplicationsets</category>
    </item>
    <item>
      <title>How to Configure Advanced Argo CD Sync Policies for GitOps</title>
      <dc:creator>DevOps Start</dc:creator>
      <pubDate>Tue, 14 Apr 2026 15:16:09 +0000</pubDate>
      <link>https://dev.to/devopsstart/how-to-configure-advanced-argo-cd-sync-policies-for-gitops-2c92</link>
      <guid>https://dev.to/devopsstart/how-to-configure-advanced-argo-cd-sync-policies-for-gitops-2c92</guid>
      <description>&lt;p&gt;&lt;em&gt;Want to move beyond basic GitOps? I've put together a deep dive on mastering Argo CD sync policies, originally published on devopsstart.com.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before diving into advanced sync policies, you need a functioning Kubernetes cluster and a baseline Argo CD installation. This tutorial assumes you've already moved past the "Hello World" phase of GitOps. If you haven't set up your initial environment yet, follow the guide on /tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation to get the controller running.&lt;/p&gt;

&lt;p&gt;To follow the examples in this guide, ensure the following tools are installed on your local machine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Cluster&lt;/strong&gt;: v1.28 or newer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubectl&lt;/strong&gt;: v1.28 or newer, configured to communicate with your cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Argo CD CLI&lt;/strong&gt;: v2.10.0 or newer. This is essential for performing manual rollbacks and interacting with the API without the GUI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git&lt;/strong&gt;: A repository (GitHub, GitLab or Bitbucket) containing your Kubernetes manifests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A Sample Application&lt;/strong&gt;: A deployment consisting of at least one Deployment, one Service and one ConfigMap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You should have a basic understanding of the Application CRD (Custom Resource Definition) and how Argo CD tracks the state between your Git repository (the desired state) and your cluster (the live state). If you are unsure how to structure your Git folders, refer to /tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Most teams start with Argo CD using "Manual Sync." You push code to Git, see the "Out of Sync" yellow badge in the UI and click the "Sync" button. While this feels safe, it's not production-grade. In large-scale environments, manual syncing creates a bottleneck and leads to configuration drift, where the cluster state diverges from Git for hours because a manual trigger was missed.&lt;/p&gt;

&lt;p&gt;Simply turning on "Automatic Sync" can be dangerous. By default, Argo CD ensures that what is in Git is present in the cluster, but it won't necessarily remove what is &lt;em&gt;not&lt;/em&gt; in Git. This leads to orphaned resources (leftover services or secrets) that can cause naming conflicts or security holes.&lt;/p&gt;

&lt;p&gt;In this tutorial, we will build a production-ready synchronization strategy. You will learn how to implement automated pruning to keep your cluster clean, configure self-healing to prevent manual "hot-fixes" from persisting and manage complex deployment orders using sync waves. We will also tackle the "Day 2" problem of rollbacks: deciding when to revert a Git commit versus using the Argo CD rollback feature.&lt;/p&gt;

&lt;p&gt;By the end of this guide, you'll have a robust GitOps pipeline that handles infrastructure lifecycle management automatically, reduces human error during deployments and provides a clear path for disaster recovery. You can find more details on the core architecture in the official &lt;a href="https://argo-cd.readthedocs.io/" rel="noopener noreferrer"&gt;Argo CD Documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Implementing Automated Pruning and Self-Healing
&lt;/h2&gt;

&lt;p&gt;The first step toward production-grade GitOps is eliminating manual intervention. Many operators avoid &lt;code&gt;prune: true&lt;/code&gt; fearing the accidental deletion of production resources. However, without pruning, your cluster becomes a graveyard of old ConfigMaps and abandoned Services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding Pruning and Self-Healing
&lt;/h3&gt;

&lt;p&gt;Pruning is the process where Argo CD identifies resources that exist in the cluster (and are managed by the app) but are no longer present in the Git repository. If pruning is disabled, deleting a file in Git does nothing to the cluster.&lt;/p&gt;

&lt;p&gt;Self-healing goes a step further. If a developer uses &lt;code&gt;kubectl edit&lt;/code&gt; to change a replica count or an environment variable directly in the cluster, Argo CD detects the drift and immediately overwrites those changes with the state defined in Git.&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuration
&lt;/h3&gt;

&lt;p&gt;To enable these, modify the &lt;code&gt;syncPolicy&lt;/code&gt; section of your Application manifest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook-production&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://github.com/argoproj/argocd-example-apps.git'&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://kubernetes.default.svc'&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply this configuration using &lt;code&gt;kubectl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; application.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testing the Policy
&lt;/h3&gt;

&lt;p&gt;To verify pruning, delete a resource from your Git repository (for example, a Service manifest) and push the change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git &lt;span class="nb"&gt;rm &lt;/span&gt;manifests/service.yaml
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Remove legacy service"&lt;/span&gt;
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait for the next sync cycle (usually 3 minutes by default, or instantly if you have a webhook configured), then check your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; guestbook
&lt;span class="c"&gt;# Expected output: Error from server (NotFound)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, test self-healing. Try to manually scale your deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl scale deployment guestbook-ui &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10 &lt;span class="nt"&gt;-n&lt;/span&gt; guestbook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run &lt;code&gt;kubectl get pods -n guestbook&lt;/code&gt;. You'll notice the pods scale up for a moment, but within 60 to 120 seconds, Argo CD will detect the drift and scale them back down to the number specified in Git.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Mastering Advanced Sync Options
&lt;/h2&gt;

&lt;p&gt;Standard syncing works for 90% of resources, but Kubernetes has immutable fields. For example, if you try to change the &lt;code&gt;selector&lt;/code&gt; of a Service or certain fields in a Job, the Kubernetes API rejects the update with a &lt;code&gt;422 Unprocessable Entity&lt;/code&gt; error. Argo CD will remain in a "Sync Failed" state indefinitely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Replace=true
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;Replace=true&lt;/code&gt; option tells Argo CD to use &lt;code&gt;kubectl replace&lt;/code&gt; or &lt;code&gt;kubectl create&lt;/code&gt; instead of &lt;code&gt;kubectl apply&lt;/code&gt;. This effectively deletes and recreates the resource if an update fails due to immutable fields.&lt;/p&gt;

&lt;p&gt;Add this to your &lt;code&gt;syncOptions&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Replace=true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;SkipDryRunOnMissingResource=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;SkipDryRunOnMissingResource=true&lt;/code&gt; is particularly useful when dealing with complex CRDs. Sometimes the dry-run validation fails because a dependent resource doesn't exist yet, even though the actual application would succeed.&lt;/p&gt;

&lt;h3&gt;
  
  
  ApplicationSet Level Policies
&lt;/h3&gt;

&lt;p&gt;If you manage 50 clusters using an &lt;code&gt;ApplicationSet&lt;/code&gt;, you don't want to define these policies 50 times. Define the &lt;code&gt;syncPolicy&lt;/code&gt; within the &lt;code&gt;template&lt;/code&gt; section of the ApplicationSet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ApplicationSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cluster-config&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;generators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;elements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;engineering-dev&lt;/span&gt;
            &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cluster&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;engineering-prod&lt;/span&gt;
            &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://prod-cluster.example.com&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{cluster}}-guestbook'&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://github.com/argoproj/argocd-example-apps.git'&lt;/span&gt;
        &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HEAD&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;
      &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{{url}}'&lt;/span&gt;
        &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;guestbook&lt;/span&gt;
      &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Implementing Sync Waves and Hooks
&lt;/h2&gt;

&lt;p&gt;In production, you cannot deploy everything simultaneously. You might need a database schema migration to finish before the API server starts, or a smoke test to pass before the LoadBalancer switches traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sync Waves
&lt;/h3&gt;

&lt;p&gt;Sync waves allow you to assign an order to resources. Argo CD applies resources in increasing order of their wave number. Resources with the same wave are applied concurrently.&lt;/p&gt;

&lt;p&gt;Add the annotation &lt;code&gt;argocd.argoproj.io/sync-wave&lt;/code&gt; to your manifests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database Migration (Wave 1):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-migration&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;migrate&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;migration-tool:v1.2.0&lt;/span&gt;
      &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnFailure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Application Deployment (Wave 2):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-server&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Deployment spec here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cache Warmup (Wave 3):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cache-warmup&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Job spec here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Argo CD waits for the Wave 1 Job to reach a "Healthy" state before attempting to create the Wave 2 Deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sync Hooks
&lt;/h3&gt;

&lt;p&gt;Hooks are used for transient tasks rather than permanent resources.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PreSync&lt;/strong&gt;: Runs before the sync starts. Ideal for backups.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sync&lt;/strong&gt;: Runs during the sync.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostSync&lt;/strong&gt;: Runs after the sync completes. Ideal for notifications or integration tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example of a PreSync backup hook:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pre-sync-backup&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/hook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PreSync&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/hook-delete-policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HookSucceeded&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backup-util:latest&lt;/span&gt;
      &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnFailure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;HookSucceeded&lt;/code&gt; policy ensures the Job object is deleted from the cluster once it completes successfully, preventing the buildup of thousands of finished Job objects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Designing Robust Rollback Strategies
&lt;/h2&gt;

&lt;p&gt;When a production deployment fails, the pressure to recover "right now" often leads to a conflict between "Pure GitOps" and "Fast Recovery."&lt;/p&gt;

&lt;h3&gt;
  
  
  Strategy A: The Git-based Rollback (Pure GitOps)
&lt;/h3&gt;

&lt;p&gt;In this approach, you never use the Argo CD UI for rollbacks. You use &lt;code&gt;git revert&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Perfect audit trail.&lt;/li&gt;
&lt;li&gt;Zero drift between Git and Cluster.&lt;/li&gt;
&lt;li&gt;Works across multiple clusters simultaneously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slower recovery time. You must commit, push and wait for the sync cycle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Execution&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git log &lt;span class="nt"&gt;--oneline&lt;/span&gt;
&lt;span class="c"&gt;# a1b2c3d (HEAD) Update image to v2.1.0 (BROKEN)&lt;/span&gt;
&lt;span class="c"&gt;# e5f6g7h Update image to v2.0.0 (STABLE)&lt;/span&gt;

git revert a1b2c3d
git push origin main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Strategy B: The Argo CD UI/CLI Rollback (Emergency Fast-Track)
&lt;/h3&gt;

&lt;p&gt;Argo CD allows you to rollback to a previous successful revision of the application. This is an immediate operation that bypasses Git.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Execution using CLI&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;argocd app rollback guestbook-production 12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Danger Zone&lt;/strong&gt;: If &lt;code&gt;automated: selfHeal: true&lt;/code&gt; is enabled, a manual rollback will be immediately overwritten. Argo CD will see the cluster is running v2.0.0 (due to the rollback) while Git still says v2.1.0. Because self-healing is on, it will "fix" the cluster by re-deploying the broken v2.1.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Matrix
&lt;/h3&gt;

&lt;p&gt;Follow these rules for professional rollback management:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;For Non-Critical Bugs&lt;/strong&gt;: Use &lt;code&gt;git revert&lt;/code&gt;. It is the only way to ensure the environment remains reproducible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For Critical Outages (P0)&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Step 1: Disable Auto-Sync in the UI or CLI.&lt;/li&gt;
&lt;li&gt;Step 2: Perform the Argo CD Rollback to a known good revision.&lt;/li&gt;
&lt;li&gt;Step 3: Fix the code in Git.&lt;/li&gt;
&lt;li&gt;Step 4: Update Git to the fixed version.&lt;/li&gt;
&lt;li&gt;Step 5: Re-enable Auto-Sync.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you encounter constant Pod failures during these transitions, you might be facing a /troubleshooting/crashloopbackoff-kubernetes scenario, which requires log analysis before deciding on a rollback strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Implementing Custom Health Checks
&lt;/h2&gt;

&lt;p&gt;Argo CD knows how to check the health of standard resources. However, if you use Custom Resource Definitions (CRDs) from an operator (like Prometheus or Istio), Argo CD only knows if the resource was created. It doesn't know if the operator actually succeeded in deploying the underlying components.&lt;/p&gt;

&lt;p&gt;This means a Sync Wave might move to Wave 2 even if the Wave 1 CRD is still in a "Pending" or "Error" state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defining a Lua Health Check
&lt;/h3&gt;

&lt;p&gt;Argo CD allows you to define health checks using Lua scripts in the &lt;code&gt;argocd-cm&lt;/code&gt; ConfigMap in the &lt;code&gt;argocd&lt;/code&gt; namespace.&lt;/p&gt;

&lt;p&gt;Assume you have a custom resource called &lt;code&gt;DatabaseInstance&lt;/code&gt; that has a &lt;code&gt;status.phase&lt;/code&gt; field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd-cm&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resource.customizations.health.apps.example.com/DatabaseInstance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;hs = {}&lt;/span&gt;
    &lt;span class="s"&gt;if obj.status ~= nil then&lt;/span&gt;
      &lt;span class="s"&gt;if obj.status.phase == 'Ready' then&lt;/span&gt;
        &lt;span class="s"&gt;hs.status = 'Healthy'&lt;/span&gt;
        &lt;span class="s"&gt;hs.message = 'Database is ready'&lt;/span&gt;
        &lt;span class="s"&gt;return hs&lt;/span&gt;
      &lt;span class="s"&gt;end&lt;/span&gt;
      &lt;span class="s"&gt;if obj.status.phase == 'Failed' then&lt;/span&gt;
        &lt;span class="s"&gt;hs.status = 'Degraded'&lt;/span&gt;
        &lt;span class="s"&gt;hs.message = 'Database failed to provision'&lt;/span&gt;
        &lt;span class="s"&gt;return hs&lt;/span&gt;
      &lt;span class="s"&gt;end&lt;/span&gt;
    &lt;span class="s"&gt;end&lt;/span&gt;
    &lt;span class="s"&gt;hs.status = 'Progressing'&lt;/span&gt;
    &lt;span class="s"&gt;hs.message = 'Waiting for database to be ready'&lt;/span&gt;
    &lt;span class="s"&gt;return hs&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the change and restart the &lt;code&gt;argocd-application-controller&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd-cm.yaml
kubectl rollout restart deployment argocd-application-controller &lt;span class="nt"&gt;-n&lt;/span&gt; argocd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, Argo CD will wait for the &lt;code&gt;DatabaseInstance&lt;/code&gt; to reach the &lt;code&gt;Ready&lt;/code&gt; phase before marking the resource as Healthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Managing Sync Windows and Maintenance Periods
&lt;/h2&gt;

&lt;p&gt;In enterprise environments, automated deployments are often prohibited during "Freeze Periods" (e.g., Black Friday). You still want GitOps to track changes, but you don't want them applied to the cluster.&lt;/p&gt;

&lt;p&gt;Argo CD doesn't have a built-in calendar, but you can implement this using labels and automation.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Label-Based Freeze Approach
&lt;/h3&gt;

&lt;p&gt;Add a label &lt;code&gt;sync-window: frozen&lt;/code&gt; to your application.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl label app guestbook-production sync-window&lt;span class="o"&gt;=&lt;/span&gt;frozen
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a simple automation (via GitHub Action or CronJob) that toggles the &lt;code&gt;automated&lt;/code&gt; sync policy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Freeze" Script:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Disable auto-sync during freeze&lt;/span&gt;
argocd app &lt;span class="nb"&gt;set &lt;/span&gt;guestbook-production &lt;span class="nt"&gt;--sync-policy&lt;/span&gt; manual
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The "Unfreeze" Script:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable auto-sync after freeze&lt;/span&gt;
argocd app &lt;span class="nb"&gt;set &lt;/span&gt;guestbook-production &lt;span class="nt"&gt;--sync-policy&lt;/span&gt; automated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a more sophisticated approach, use an external controller that watches for these labels and modifies the Application spec. This ensures the cluster remains untouched until the window opens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Issue 1: Resource "Flickering" (The Sync Loop)
&lt;/h3&gt;

&lt;p&gt;A resource constantly switches between "Synced" and "Out of Sync." This typically happens when a controller (like an HPA or Service Mesh) modifies the resource after Argo CD applies it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix&lt;/strong&gt;: Use &lt;code&gt;ignoreDifferences&lt;/code&gt; to tell Argo CD to ignore fields managed by other controllers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ignoreDifferences&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
      &lt;span class="na"&gt;jsonPointers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/spec/replicas&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue 2: Pruning Deleted Critical Resources
&lt;/h3&gt;

&lt;p&gt;You accidentally deleted a namespace or a critical Secret in Git, and Argo CD pruned it from the cluster, causing an outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix&lt;/strong&gt;: Use the &lt;code&gt;prune&lt;/code&gt; safety override. Annotate specific resources to prevent them from being pruned, regardless of the application-level policy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prune=false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Issue 3: Sync Wave Hanging
&lt;/h3&gt;

&lt;p&gt;A sync wave is stuck in "Progressing" and refuses to move to the next wave.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix&lt;/strong&gt;: Check the health of the resource in the current wave. If it's a Job, ensure it is actually completing. If you implemented a custom health check, ensure the Lua script isn't returning &lt;code&gt;Progressing&lt;/code&gt; indefinitely due to a typo in the status field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe job db-migration &lt;span class="nt"&gt;-n&lt;/span&gt; guestbook
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Does &lt;code&gt;prune: true&lt;/code&gt; delete resources in namespaces not managed by Argo CD?&lt;/strong&gt;&lt;br&gt;
A: No. Argo CD only prunes resources that are tracked within the specific Application's scope and managed by that application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Can I have different sync waves for different clusters?&lt;/strong&gt;&lt;br&gt;
A: Yes. Since sync waves are defined as annotations on the manifests themselves, you can use Kustomize or Helm to apply different annotations based on the target environment (e.g., a longer warmup wave in production than in dev).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What happens if a Sync Hook fails?&lt;/strong&gt;&lt;br&gt;
A: If a &lt;code&gt;PreSync&lt;/code&gt; hook fails, Argo CD will stop the sync process and mark the application as "Degraded," preventing the deployment of potentially broken code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Is &lt;code&gt;Replace=true&lt;/code&gt; safe for all resources?&lt;/strong&gt;&lt;br&gt;
A: Not always. Since it deletes and recreates the resource, any fields not defined in your Git manifest (like some dynamically assigned annotations or labels) will be lost. Use it only for resources with immutable fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Moving from manual synchronization to advanced sync policies separates a "demo" GitOps setup from a production-grade platform. By implementing automated pruning and self-healing, you eliminate configuration drift and ensure Git is the absolute source of truth. Sync waves and hooks bring the orchestration capabilities of traditional CI/CD pipelines into the declarative world of Kubernetes.&lt;/p&gt;

&lt;p&gt;In this tutorial, we've covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enabling &lt;code&gt;prune&lt;/code&gt; and &lt;code&gt;selfHeal&lt;/code&gt; to maintain cluster hygiene.&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;Replace=true&lt;/code&gt; to handle immutable Kubernetes fields.&lt;/li&gt;
&lt;li&gt;Orchestrating complex deployments with Sync Waves and Hooks.&lt;/li&gt;
&lt;li&gt;The critical distinction between Git reverts and Argo CD rollbacks.&lt;/li&gt;
&lt;li&gt;Extending Argo CD's intelligence with custom Lua health checks.&lt;/li&gt;
&lt;li&gt;Managing deployment freezes using sync windows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your next steps should be to audit your current Application manifests. Identify resources managed by other controllers and apply &lt;code&gt;ignoreDifferences&lt;/code&gt; to stop sync flickering. Then, map out your application dependencies and assign sync waves to ensure your databases always precede your APIs.&lt;/p&gt;

&lt;p&gt;This concludes our deep dive into Argo CD and our series on GitOps automation. For those looking to further their expertise in site reliability, our guide on /interview/senior-sre-interview-questions-answers-for-2026 provides insight into how these patterns are evaluated in professional settings.&lt;/p&gt;

</description>
      <category>argocd</category>
      <category>gitopsautomation</category>
      <category>kubernetesdeployment</category>
      <category>syncwaves</category>
    </item>
  </channel>
</rss>
