DevOps Start

Posted on Apr 14 • Originally published at devopsstart.com

How to Configure Advanced Argo CD Sync Policies for GitOps

#argocd #gitopsautomation #kubernetesdeployment #syncwaves

Want to move beyond basic GitOps? I've put together a deep dive on mastering Argo CD sync policies, originally published on devopsstart.com.

Prerequisites

Before diving into advanced sync policies, you need a functioning Kubernetes cluster and a baseline Argo CD installation. This tutorial assumes you've already moved past the "Hello World" phase of GitOps. If you haven't set up your initial environment yet, follow the guide on /tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation to get the controller running.

To follow the examples in this guide, ensure the following tools are installed on your local machine:

Kubernetes Cluster: v1.28 or newer.
kubectl: v1.28 or newer, configured to communicate with your cluster.
Argo CD CLI: v2.10.0 or newer. This is essential for performing manual rollbacks and interacting with the API without the GUI.
Git: A repository (GitHub, GitLab or Bitbucket) containing your Kubernetes manifests.
A Sample Application: A deployment consisting of at least one Deployment, one Service and one ConfigMap.

You should have a basic understanding of the Application CRD (Custom Resource Definition) and how Argo CD tracks the state between your Git repository (the desired state) and your cluster (the live state). If you are unsure how to structure your Git folders, refer to /tutorials/how-to-set-up-argo-cd-gitops-for-kubernetes-automation.

Overview

Most teams start with Argo CD using "Manual Sync." You push code to Git, see the "Out of Sync" yellow badge in the UI and click the "Sync" button. While this feels safe, it's not production-grade. In large-scale environments, manual syncing creates a bottleneck and leads to configuration drift, where the cluster state diverges from Git for hours because a manual trigger was missed.

Simply turning on "Automatic Sync" can be dangerous. By default, Argo CD ensures that what is in Git is present in the cluster, but it won't necessarily remove what is not in Git. This leads to orphaned resources (leftover services or secrets) that can cause naming conflicts or security holes.

In this tutorial, we will build a production-ready synchronization strategy. You will learn how to implement automated pruning to keep your cluster clean, configure self-healing to prevent manual "hot-fixes" from persisting and manage complex deployment orders using sync waves. We will also tackle the "Day 2" problem of rollbacks: deciding when to revert a Git commit versus using the Argo CD rollback feature.

By the end of this guide, you'll have a robust GitOps pipeline that handles infrastructure lifecycle management automatically, reduces human error during deployments and provides a clear path for disaster recovery. You can find more details on the core architecture in the official Argo CD Documentation.

Step 1: Implementing Automated Pruning and Self-Healing

The first step toward production-grade GitOps is eliminating manual intervention. Many operators avoid prune: true fearing the accidental deletion of production resources. However, without pruning, your cluster becomes a graveyard of old ConfigMaps and abandoned Services.

Understanding Pruning and Self-Healing

Pruning is the process where Argo CD identifies resources that exist in the cluster (and are managed by the app) but are no longer present in the Git repository. If pruning is disabled, deleting a file in Git does nothing to the cluster.

Self-healing goes a step further. If a developer uses kubectl edit to change a replica count or an environment variable directly in the cluster, Argo CD detects the drift and immediately overwrites those changes with the state defined in Git.

Configuration

To enable these, modify the syncPolicy section of your Application manifest.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: guestbook-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: 'https://github.com/argoproj/argocd-example-apps.git'
    targetRevision: HEAD
    path: guestbook
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: guestbook
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Apply this configuration using kubectl:

kubectl apply -f application.yaml

Testing the Policy

To verify pruning, delete a resource from your Git repository (for example, a Service manifest) and push the change.

git rm manifests/service.yaml
git commit -m "Remove legacy service"
git push origin main

Wait for the next sync cycle (usually 3 minutes by default, or instantly if you have a webhook configured), then check your cluster:

kubectl get svc -n guestbook
# Expected output: Error from server (NotFound)

Now, test self-healing. Try to manually scale your deployment:

kubectl scale deployment guestbook-ui --replicas=10 -n guestbook

Run kubectl get pods -n guestbook. You'll notice the pods scale up for a moment, but within 60 to 120 seconds, Argo CD will detect the drift and scale them back down to the number specified in Git.

Step 2: Mastering Advanced Sync Options

Standard syncing works for 90% of resources, but Kubernetes has immutable fields. For example, if you try to change the selector of a Service or certain fields in a Job, the Kubernetes API rejects the update with a 422 Unprocessable Entity error. Argo CD will remain in a "Sync Failed" state indefinitely.

Using Replace=true

The Replace=true option tells Argo CD to use kubectl replace or kubectl create instead of kubectl apply. This effectively deletes and recreates the resource if an update fails due to immutable fields.

Add this to your syncOptions:

spec:
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - Replace=true
      - SkipDryRunOnMissingResource=true

SkipDryRunOnMissingResource=true is particularly useful when dealing with complex CRDs. Sometimes the dry-run validation fails because a dependent resource doesn't exist yet, even though the actual application would succeed.

ApplicationSet Level Policies

If you manage 50 clusters using an ApplicationSet, you don't want to define these policies 50 times. Define the syncPolicy within the template section of the ApplicationSet.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: cluster-config
spec:
  generators:
    - list:
        elements:
          - cluster: engineering-dev
            url: https://kubernetes.default.svc
          - cluster: engineering-prod
            url: https://prod-cluster.example.com
  template:
    metadata:
      name: '{{cluster}}-guestbook'
    spec:
      project: default
      source:
        repoURL: 'https://github.com/argoproj/argocd-example-apps.git'
        targetRevision: HEAD
        path: guestbook
      destination:
        server: '{{url}}'
        namespace: guestbook
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Step 3: Implementing Sync Waves and Hooks

In production, you cannot deploy everything simultaneously. You might need a database schema migration to finish before the API server starts, or a smoke test to pass before the LoadBalancer switches traffic.

Sync Waves

Sync waves allow you to assign an order to resources. Argo CD applies resources in increasing order of their wave number. Resources with the same wave are applied concurrently.

Add the annotation argocd.argoproj.io/sync-wave to your manifests.

Database Migration (Wave 1):

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
  annotations:
    argocd.argoproj.io/sync-wave: "1"
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: migration-tool:v1.2.0
      restartPolicy: OnFailure

Application Deployment (Wave 2):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  # Deployment spec here

Cache Warmup (Wave 3):

apiVersion: batch/v1
kind: Job
metadata:
  name: cache-warmup
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  # Job spec here

Argo CD waits for the Wave 1 Job to reach a "Healthy" state before attempting to create the Wave 2 Deployment.

Sync Hooks

Hooks are used for transient tasks rather than permanent resources.

PreSync: Runs before the sync starts. Ideal for backups.
Sync: Runs during the sync.
PostSync: Runs after the sync completes. Ideal for notifications or integration tests.

Example of a PreSync backup hook:

apiVersion: batch/v1
kind: Job
metadata:
  name: pre-sync-backup
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - name: backup
          image: backup-util:latest
      restartPolicy: OnFailure

The HookSucceeded policy ensures the Job object is deleted from the cluster once it completes successfully, preventing the buildup of thousands of finished Job objects.

Step 4: Designing Robust Rollback Strategies

When a production deployment fails, the pressure to recover "right now" often leads to a conflict between "Pure GitOps" and "Fast Recovery."

Strategy A: The Git-based Rollback (Pure GitOps)

In this approach, you never use the Argo CD UI for rollbacks. You use git revert.

Pros:

Perfect audit trail.
Zero drift between Git and Cluster.
Works across multiple clusters simultaneously.

Cons:

Slower recovery time. You must commit, push and wait for the sync cycle.

Execution:

git log --oneline
# a1b2c3d (HEAD) Update image to v2.1.0 (BROKEN)
# e5f6g7h Update image to v2.0.0 (STABLE)

git revert a1b2c3d
git push origin main

Strategy B: The Argo CD UI/CLI Rollback (Emergency Fast-Track)

Argo CD allows you to rollback to a previous successful revision of the application. This is an immediate operation that bypasses Git.

Execution using CLI:

argocd app rollback guestbook-production 12

The Danger Zone: If automated: selfHeal: true is enabled, a manual rollback will be immediately overwritten. Argo CD will see the cluster is running v2.0.0 (due to the rollback) while Git still says v2.1.0. Because self-healing is on, it will "fix" the cluster by re-deploying the broken v2.1.0.

The Decision Matrix

Follow these rules for professional rollback management:

For Non-Critical Bugs: Use git revert. It is the only way to ensure the environment remains reproducible.
For Critical Outages (P0):

Step 1: Disable Auto-Sync in the UI or CLI.
Step 2: Perform the Argo CD Rollback to a known good revision.
Step 3: Fix the code in Git.
Step 4: Update Git to the fixed version.
Step 5: Re-enable Auto-Sync.

If you encounter constant Pod failures during these transitions, you might be facing a /troubleshooting/crashloopbackoff-kubernetes scenario, which requires log analysis before deciding on a rollback strategy.

Step 5: Implementing Custom Health Checks

Argo CD knows how to check the health of standard resources. However, if you use Custom Resource Definitions (CRDs) from an operator (like Prometheus or Istio), Argo CD only knows if the resource was created. It doesn't know if the operator actually succeeded in deploying the underlying components.

This means a Sync Wave might move to Wave 2 even if the Wave 1 CRD is still in a "Pending" or "Error" state.

Defining a Lua Health Check

Argo CD allows you to define health checks using Lua scripts in the argocd-cm ConfigMap in the argocd namespace.

Assume you have a custom resource called DatabaseInstance that has a status.phase field.

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  resource.customizations.health.apps.example.com/DatabaseInstance: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.phase == 'Ready' then
        hs.status = 'Healthy'
        hs.message = 'Database is ready'
        return hs
      end
      if obj.status.phase == 'Failed' then
        hs.status = 'Degraded'
        hs.message = 'Database failed to provision'
        return hs
      end
    end
    hs.status = 'Progressing'
    hs.message = 'Waiting for database to be ready'
    return hs

Apply the change and restart the argocd-application-controller:

kubectl apply -f argocd-cm.yaml
kubectl rollout restart deployment argocd-application-controller -n argocd

Now, Argo CD will wait for the DatabaseInstance to reach the Ready phase before marking the resource as Healthy.

Step 6: Managing Sync Windows and Maintenance Periods

In enterprise environments, automated deployments are often prohibited during "Freeze Periods" (e.g., Black Friday). You still want GitOps to track changes, but you don't want them applied to the cluster.

Argo CD doesn't have a built-in calendar, but you can implement this using labels and automation.

The Label-Based Freeze Approach

Add a label sync-window: frozen to your application.

kubectl label app guestbook-production sync-window=frozen

Create a simple automation (via GitHub Action or CronJob) that toggles the automated sync policy.

The "Freeze" Script:

# Disable auto-sync during freeze
argocd app set guestbook-production --sync-policy manual

The "Unfreeze" Script:

# Enable auto-sync after freeze
argocd app set guestbook-production --sync-policy automated

For a more sophisticated approach, use an external controller that watches for these labels and modifies the Application spec. This ensures the cluster remains untouched until the window opens.

Troubleshooting

Issue 1: Resource "Flickering" (The Sync Loop)

A resource constantly switches between "Synced" and "Out of Sync." This typically happens when a controller (like an HPA or Service Mesh) modifies the resource after Argo CD applies it.

The Fix: Use ignoreDifferences to tell Argo CD to ignore fields managed by other controllers.

spec:
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas

Issue 2: Pruning Deleted Critical Resources

You accidentally deleted a namespace or a critical Secret in Git, and Argo CD pruned it from the cluster, causing an outage.

The Fix: Use the prune safety override. Annotate specific resources to prevent them from being pruned, regardless of the application-level policy.

metadata:
  annotations:
    argocd.argoproj.io/sync-options: Prune=false

Issue 3: Sync Wave Hanging

A sync wave is stuck in "Progressing" and refuses to move to the next wave.

The Fix: Check the health of the resource in the current wave. If it's a Job, ensure it is actually completing. If you implemented a custom health check, ensure the Lua script isn't returning Progressing indefinitely due to a typo in the status field.

kubectl describe job db-migration -n guestbook

FAQ

Q: Does prune: true delete resources in namespaces not managed by Argo CD?
A: No. Argo CD only prunes resources that are tracked within the specific Application's scope and managed by that application.

Q: Can I have different sync waves for different clusters?
A: Yes. Since sync waves are defined as annotations on the manifests themselves, you can use Kustomize or Helm to apply different annotations based on the target environment (e.g., a longer warmup wave in production than in dev).

Q: What happens if a Sync Hook fails?
A: If a PreSync hook fails, Argo CD will stop the sync process and mark the application as "Degraded," preventing the deployment of potentially broken code.

Q: Is Replace=true safe for all resources?
A: Not always. Since it deletes and recreates the resource, any fields not defined in your Git manifest (like some dynamically assigned annotations or labels) will be lost. Use it only for resources with immutable fields.

Conclusion

Moving from manual synchronization to advanced sync policies separates a "demo" GitOps setup from a production-grade platform. By implementing automated pruning and self-healing, you eliminate configuration drift and ensure Git is the absolute source of truth. Sync waves and hooks bring the orchestration capabilities of traditional CI/CD pipelines into the declarative world of Kubernetes.

In this tutorial, we've covered:

Enabling prune and selfHeal to maintain cluster hygiene.
Using Replace=true to handle immutable Kubernetes fields.
Orchestrating complex deployments with Sync Waves and Hooks.
The critical distinction between Git reverts and Argo CD rollbacks.
Extending Argo CD's intelligence with custom Lua health checks.
Managing deployment freezes using sync windows.

Your next steps should be to audit your current Application manifests. Identify resources managed by other controllers and apply ignoreDifferences to stop sync flickering. Then, map out your application dependencies and assign sync waves to ensure your databases always precede your APIs.

This concludes our deep dive into Argo CD and our series on GitOps automation. For those looking to further their expertise in site reliability, our guide on /interview/senior-sre-interview-questions-answers-for-2026 provides insight into how these patterns are evaluated in professional settings.

DEV Community