Nerav Doshi

Posted on Jun 15 • Edited on Jun 24 • Originally published at pipelineandprompts.com

Zero-Downtime Deployments on OpenShift with GitHub Actions and Feature Flags

#openshift #githubactions #zerodowntimedeployments #featureflags

Pipeline & Prompts | Byte size guides on DevOps, Cloud and AI

Byte size summary

After reading this article, you will know how to implement a blue/green deployment pipeline on OpenShift that uses HAProxy-backed Route weight splitting for traffic control and Flagsmith for feature flag management — and more importantly, you will know where the implementation breaks silently. Specifically: the HAProxy propagation gap that lets your smoke tests lie to you, the partial rollout state that puts two versions in production simultaneously, and why the standard approach of patching a Route weight and immediately proceeding has cost teams I've worked with entire migrations. The implementation uses GitHub Actions for orchestration, oc commands for OpenShift-specific traffic control, and Flagsmith as the feature flag service. The patterns apply to AKS, EKS, and GKE with platform-specific variations called out.

The story

In 2019 I was working on an EDI integration for a logistics client. The system moved shipment confirmations between a warehouse management platform and a carrier's TMS. It was not glamorous infrastructure, but it was load-bearing in the way that only becomes obvious when it stops working.

It stopped working on a Tuesday afternoon. No alarm fired. No dashboard went red. The integration just quietly stopped processing records. Operations managers figured it out around 6pm when the spreadsheets they maintained as a parallel source of truth diverged far enough to be noticed. By then the warehouse had been running off manual coordination for four hours, warehouse associates were staying late to reconcile records by hand, and someone had already called a carrier to explain why shipments confirmed that morning hadn't moved.

In automotive supply chains a failed integration can idle a production line. The cost isn't abstract — it's labor, overtime, contractual penalties, and a certain kind of trust that takes months to rebuild. That experience has shaped how I think about deployment risk ever since. Downtime has a zip code and a loading dock.

My first OpenShift deployment in that same era was instructive in a different way. The cluster was managed, the application was straightforward, and everything worked in the developer environment. We migrated to containerised deployment and hit ImagePullBackOff in production because the service account didn't have pull rights from the internal registry. That was fixable in twenty minutes. What wasn't fixable was the east-west traffic blocked by a NetworkPolicy that nobody had documented and that didn't exist in the permissive dev namespace. The application couldn't reach its own database. We retreated to the legacy application. Not a rollback — an abandonment. We'd built no safe path back that didn't lose state.

The deployment strategy had failed before we'd written a line of GitHub Actions YAML.

Around that time I was in a meeting with a Field CTO who understood feature flags conceptually — had read the LaunchDarkly white papers, knew the theory. But nobody in the room had the tooling experience, and no proof of concept existed. The decision stalled. I learned something from that meeting: being ahead of the concept is not the same as having the implementation. This article is the synthesis of that learning arc. Not a single project success story — an honest account of what the correct implementation looks like and where it breaks.

The problem

Platform engineers and SREs on OpenShift clusters face a specific version of the zero-downtime deployment problem that generic Kubernetes tutorials don't address. The vanilla kubectl rollout story breaks down in at least three places.

HAProxy is not nginx. OpenShift's Ingress Operator uses HAProxy-backed routers. Traffic splitting between blue and green isn't a load balancer weight change or an Nginx upstream swap — it's controlled through the Route object's alternateBackends and weight parameters. The propagation behaviour is different, the timing is different, and the failure modes are different.

Deployment knowledge lives in people, not pipelines. On small teams with a mix of experience levels, the deployment process exists as a combination of a script nobody fully understands and the mental model of whoever wrote it. This is the real failure mode — not the technology. When the engineer who wrote the script isn't on shift, the handoff becomes the primary risk surface. I've been on teams where deployments took 15–16 hours because every stage required a human to validate and continue. Not as a safety mechanism — as a substitute for pipeline logic that never got written. The manual gate was a single point of failure with a person attached to it.

The rollback path is usually an afterthought. It gets tested once during setup, if at all. By the time you need it under pressure, you discover it requires manual steps that aren't documented, or it works but loses session state, or it reverts infrastructure that should have stayed updated. A deployment strategy without a practiced rollback path isn't zero-downtime — it's a slower way to take downtime.

Why existing approaches fall short

Kubernetes rolling deployments handle pod replacement gracefully but give you no traffic control during the transition. (If you need a primer on Kubernetes at production scale, this covers the fundamentals.) You can't send 10% of traffic to the new version to validate behaviour before full cutover. If the new version has a bad interaction with production data or a production-specific dependency, the rolling update has already replaced half your pods before you know something is wrong.

Basic blue/green without validation is the pattern most tutorials implement: deploy green, patch the Route, call it done. The gap is that patching the Route and HAProxy propagating the change are not instantaneous or synchronous. In a multi-replica Ingress Operator setup, different HAProxy router pods can be serving different weights simultaneously during propagation. Smoke tests run immediately after oc patch route can pass against the old version, giving false confidence before green is actually receiving traffic.

Manual gates solve the confidence problem but at the cost of deployment velocity and on-call sanity. A pipeline that requires a human to confirm each stage at 2am is a pipeline that will eventually be skipped.

Feature flags without deployment integration leave you with two independent controls that don't know about each other. The deployment can succeed while the flag is still off, or the flag can be enabled before the deployment has stabilised. The coordination happens in Slack or in someone's head, which means it doesn't happen consistently.

The architecture

Diagram 1: Traffic control lives in the Route object. The HAProxy router is the single control plane for the split. The dashed red zone marks the propagation gap — the window between oc patch route and HAProxy actually applying the change across all router pods.

The key design decision this diagram makes visible: traffic control and feature control are separate concerns that the pipeline coordinates, not conflates. The Route controls which Deployment receives traffic and in what proportion. Flagsmith controls which features within the deployed code are active. The pipeline is the coordinator — it advances the Route weight only after the HAProxy propagation check passes, and it enables flags only after the smoke tests pass against real traffic, not against the pod health endpoint.

The blast radius is bounded by the Route weight at all times. The pipeline can return all traffic to blue with a single Route patch — faster than a rollout, and it doesn't destroy the green Deployment or lose its configuration.

OpenShift-specific notes:

Traffic splitting uses route.spec.alternateBackends — this is an OpenShift Route extension, not standard Kubernetes Ingress
The Ingress Operator runs HAProxy router pods; the number of replicas affects propagation timing
Service accounts for the pipeline require patch on routes in the application namespace and get/list on pods and replicasets for validation

Implementation

Prerequisites

OpenShift 4.12 or later (HAProxy-based Ingress Operator; alternateBackends available since 4.x)
oc CLI matching cluster version — do not use kubectl for Route operations; kubectl does not understand alternateBackends
GitHub Actions runner with network access to the OpenShift API endpoint
A service account token stored as a GitHub Actions secret (OC_TOKEN, OC_SERVER)
Flagsmith account or self-hosted Flagsmith instance; Flagsmith server-side environment key stored as FLAGSMITH_ENV_KEY and Admin API token stored as FLAGSMITH_ADMIN_TOKEN
Two Kubernetes Services already deployed: myapp-blue and myapp-green in the target namespace
A Route named myapp already configured with myapp-blue as the primary backend

The pipeline assumes myapp-blue is the current production version and myapp-green is the slot being deployed to.

Step 1 — Create the OpenShift service account for GitHub Actions

# Create a dedicated service account — do not reuse cluster-admin or developer accounts
oc create serviceaccount github-actions-deploy -n myapp-production

# Bind the minimum required permissions
oc create role github-actions-deploy-role \
  --verb=get,list,patch,update \
  --resource=routes,deployments,replicasets,pods \
  -n myapp-production

oc create rolebinding github-actions-deploy-binding \
  --role=github-actions-deploy-role \
  --serviceaccount=myapp-production:github-actions-deploy \
  -n myapp-production

# Generate a long-lived token
# Note: on OpenShift 4.12+, token duration is capped by the cluster's
# --service-account-max-token-expiration policy. The command below will
# silently cap the duration if 8760h exceeds your cluster's limit.
# Verify the cap with:
#   oc get configmap config -n openshift-apiserver -o yaml \
#     | grep serviceAccountMaxTokenExpiration
oc create token github-actions-deploy \
  --duration=8760h \
  -n myapp-production
# Store the output as the OC_TOKEN GitHub secret

Rollback consideration: this service account can be deleted and recreated. Removing it does not affect running workloads — it only breaks the pipeline until recreated.

Step 2 — Configure the Route for blue/green traffic splitting

# Verify current Route state before touching it
oc get route myapp -n myapp-production -o yaml

# Patch the Route to add green as an alternate backend at 0% weight
# This sets up the split structure without shifting any traffic yet
oc patch route myapp -n myapp-production \
  --type=json \
  -p '[
    {
      "op": "add",
      "path": "/spec/alternateBackends",
      "value": [
        {
          "kind": "Service",
          "name": "myapp-green",
          "weight": 0
        }
      ]
    },
    {
      "op": "replace",
      "path": "/spec/to/weight",
      "value": 100
    }
  ]'

# Verify the patch applied correctly
oc get route myapp -n myapp-production \
  -o jsonpath='{.spec.to.weight} {.spec.alternateBackends[0].weight}'
# Expected output: 100 0

Weight arithmetic note: OpenShift normalises weights relative to each other, so 90+10 and 9+1 produce the same 90/10 traffic split. Weights must not both be 0 — this is invalid and will revert to default behaviour. The values shown in this article (90/10, 0/100, 100/0) are explicit and unambiguous.

Rollback consideration: to remove green from the Route entirely, delete the alternateBackends field and set the primary weight back to 100. This is non-destructive to the green Deployment.

Step 3 — GitHub Actions workflow: RBAC preflight, deploy, validate, shift traffic

Diagram 2: The full pipeline. The RBAC preflight runs first — before any deployment work. The HAProxy validation loop (step 6) is what most pipelines skip. The promote/rollback fork at the bottom is the Flagsmith gate.

# [AUTHOR TO VALIDATE] — review all oc commands against your cluster version
# before using in production
name: Zero-Downtime Deploy to OpenShift

on:
  push:
    branches: [main]

env:
  NAMESPACE: myapp-production
  ROUTE_NAME: myapp
  GREEN_SERVICE: myapp-green
  BLUE_SERVICE: myapp-blue
  HAPROXY_PROPAGATION_WAIT: 15  # seconds; tune for your Ingress Operator replica count

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install oc CLI
        run: |
          # [AUTHOR TO VALIDATE] — pin to your cluster's minor version
          curl -sL https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz \
            | tar xz -C /usr/local/bin oc
          oc version --client

      - name: Log in to OpenShift
        run: |
          oc login ${{ secrets.OC_SERVER }} \
            --token=${{ secrets.OC_TOKEN }} \
            --insecure-skip-tls-verify=false

      # RBAC preflight runs first — before any deployment work.
      # If the service account can't patch Routes, fail here rather than
      # after green is half-deployed and the Route is in an inconsistent state.
      - name: RBAC preflight check
        run: |
          oc auth can-i patch routes \
            --as=system:serviceaccount:${{ env.NAMESPACE }}:github-actions-deploy \
            -n ${{ env.NAMESPACE }}

          oc auth can-i update deployments \
            --as=system:serviceaccount:${{ env.NAMESPACE }}:github-actions-deploy \
            -n ${{ env.NAMESPACE }}

      - name: Deploy to green slot
        run: |
          # [AUTHOR TO VALIDATE] — replace with your actual image update command
          oc set image deployment/myapp-green \
            myapp-green=${{ env.IMAGE }}:${{ github.sha }} \
            -n ${{ env.NAMESPACE }}

          # Wait for rollout — do not proceed until green is healthy
          oc rollout status deployment/myapp-green \
            -n ${{ env.NAMESPACE }} \
            --timeout=5m

      - name: Shift 10% traffic to green
        run: |
          oc patch route ${{ env.ROUTE_NAME }} -n ${{ env.NAMESPACE }} \
            --type=json \
            -p '[
              {"op": "replace", "path": "/spec/to/weight", "value": 90},
              {"op": "replace", "path": "/spec/alternateBackends/0/weight", "value": 10}
            ]'

      # HAProxy propagation wait — this is not optional.
      # The Route object accepting the patch does not mean all HAProxy router
      # pods have applied the change. Without this loop, smoke tests run against
      # stale HAProxy state and can pass against the old version.
      - name: Wait for HAProxy propagation
        run: |
          wait_for_haproxy_propagation() {
            local expected_weight=$1
            local max_attempts=12
            local attempt=0

            while [ $attempt -lt $max_attempts ]; do
              current=$(oc get route ${{ env.ROUTE_NAME }} \
                -n ${{ env.NAMESPACE }} \
                -o jsonpath='{.spec.alternateBackends[0].weight}')

              if [ "$current" == "$expected_weight" ]; then
                echo "Route weight confirmed: $current"
                return 0
              fi

              echo "Attempt $((attempt+1))/$max_attempts — current weight: $current, waiting..."
              sleep 5
              attempt=$((attempt+1))
            done

            echo "HAProxy propagation check timed out"
            return 1
          }

          wait_for_haproxy_propagation 10

          # Note: the Route object reflecting the correct weight does not guarantee
          # all HAProxy router pods have applied the configuration. This is a
          # necessary but not sufficient check. The smoke test against the Route
          # hostname provides the actual validation signal.

      - name: Smoke test against live traffic
        run: |
          # Test against the Route hostname, not the Service or pod IP.
          # Testing against the Service bypasses HAProxy entirely and will always
          # show the new version regardless of Route weight state.
          ROUTE_HOST=$(oc get route ${{ env.ROUTE_NAME }} \
            -n ${{ env.NAMESPACE }} \
            -o jsonpath='{.spec.host}')

          curl -sf --retry 5 --retry-delay 3 \
            https://$ROUTE_HOST/health || {
            echo "Smoke test failed — rolling back to blue"
            oc patch route ${{ env.ROUTE_NAME }} -n ${{ env.NAMESPACE }} \
              --type=json \
              -p '[
                {"op": "replace", "path": "/spec/to/weight", "value": 100},
                {"op": "replace", "path": "/spec/alternateBackends/0/weight", "value": 0}
              ]'
            exit 1
          }

      - name: Shift 100% traffic to green
        run: |
          oc patch route ${{ env.ROUTE_NAME }} -n ${{ env.NAMESPACE }} \
            --type=json \
            -p '[
              {"op": "replace", "path": "/spec/to/weight", "value": 0},
              {"op": "replace", "path": "/spec/alternateBackends/0/weight", "value": 100}
            ]'

          # Wait for full propagation before enabling the flag
          wait_for_haproxy_propagation() {
            local expected_weight=$1
            local max_attempts=12
            local attempt=0
            while [ $attempt -lt $max_attempts ]; do
              current=$(oc get route ${{ env.ROUTE_NAME }} \
                -n ${{ env.NAMESPACE }} \
                -o jsonpath='{.spec.alternateBackends[0].weight}')
              if [ "$current" == "$expected_weight" ]; then
                echo "Full propagation confirmed"
                return 0
              fi
              sleep 5
              attempt=$((attempt+1))
            done
            echo "Full propagation timed out"
            return 1
          }
          wait_for_haproxy_propagation 100

      - name: Enable feature flag in Flagsmith
        run: |
          # Uses Flagsmith's experimental Admin API update endpoint.
          # Authentication requires a server-side Admin API token (not the public
          # Environment Key) — use an environment-scoped token, never an account key.
          # Returns 204 No Content on success.
          # [AUTHOR TO VALIDATE] — confirm environment_key matches your production
          # Flagsmith environment and that change requests are not enabled
          # (this endpoint is incompatible with change request workflows).
          curl -sf -X POST \
            "https://api.flagsmith.com/api/experiments/environments/${{ secrets.FLAGSMITH_ENV_KEY }}/update-flag-v1/" \
            -H "Authorization: Api-Key ${{ secrets.FLAGSMITH_ADMIN_TOKEN }}" \
            -H "Content-Type: application/json" \
            -d '{
              "feature": {"name": "new_checkout_flow"},
              "enabled": true,
              "value": {"type": "boolean", "value": "true"}
            }'

      - name: Mark blue as standby
        run: |
          # Scale down blue but do not delete it — it is the rollback target.
          # Keeping one replica running means rollback is a Route patch,
          # not a scale-up-then-patch sequence under pressure.
          oc scale deployment/myapp-blue --replicas=1 \
            -n ${{ env.NAMESPACE }}
          echo "Blue deployment scaled to 1 replica (standby)"

Step 4 — Understanding the HAProxy propagation gap

The wait_for_haproxy_propagation function in Step 3 polls the Route object. This is necessary but not sufficient. There is a meaningful gap between the Route object reflecting the correct weight and all HAProxy router pods actually applying that configuration — the size of this gap is real, environment-dependent, and undocumented. In a cluster where the Ingress Operator runs multiple HAProxy router replicas, propagation is per-replica: different router pods can serve different weights simultaneously during the window.

This is why the smoke test runs against the Route hostname rather than the Service directly. The Service bypasses HAProxy entirely. Only a test through the Route hostname catches the propagation state you actually care about.

Blast radius states

When the pipeline fails mid-deployment — after shifting traffic but before completing validation — the resulting state depends on exactly where the failure landed. These three states have different symptoms and different levels of operational risk.

Diagram 3: Three ways the propagation can fail. State 2 is the most dangerous because it is silent — both versions are live, bugs are intermittent, and correlation with the deployment is difficult.

State 1 — HAProxy still on blue. The most common failure mode. The Route weight shows green in the config, but HAProxy hasn't propagated yet. Users still get blue. Smoke tests run direct against the Service and pass. The slot detection logic is now inverted — every subsequent deployment decision is made against incorrect state. Low immediate user impact, high operational confusion.

State 2 — Partial propagation across router replicas. The most dangerous state. Router Pod A is serving blue, Router Pod B is serving green. Both versions are live in production simultaneously. Bugs in the new version affect some users but not others, with no obvious correlation to the deployment. Standard monitoring may not surface this at all — aggregate error rates may not move if the new version's bugs are subtle. This state requires active diagnosis: compare error rates per request across a sample window and look for bimodal distribution.

State 3 — Full propagation, timed-out validation. The validation loop completed its maximum attempts before the Route weight was confirmed. HAProxy has fully propagated to green — the deployment is actually correct. But the pipeline has triggered a rollback of a successful deployment, returning all traffic to blue and leaving green deployed but dark. The operational waste is real; the bigger risk is eroding pipeline trust. If this happens repeatedly, teams start skipping the validation loop to avoid false rollbacks, which removes the only protection against State 2.

Diagnosing which state you're in: check oc get route myapp -o yaml for the weight values first, then compare against what traffic is actually being served using the Route hostname. Discrepancy between config and observed traffic is State 1 or State 2.

Security considerations

Service account scope creep. The github-actions-deploy service account starts with a reasonable Role, but in practice teams expand it incrementally when deployments fail for permission reasons. After six months the service account often has broader permissions than the original design intended. Audit with oc auth can-i --list --as=system:serviceaccount:myapp-production:github-actions-deploy -n myapp-production on a schedule — not just at setup. The blast radius of a compromised pipeline token is the blast radius of whatever this service account can do.

Feature flag API key exposure. The Flagsmith Admin API token in GitHub Actions secrets is a long-lived credential. If it leaks, an attacker can enable or disable features in production without touching the cluster. Use environment-level API tokens, not account-level tokens — Flagsmith supports environment-scoped keys specifically to limit this blast radius. Treat flag state changes as deployments: they have the same production impact.

HAProxy timeout partial state. If the pipeline fails mid-deployment — after shifting traffic to green but before the final validation — you can be left in State 2 (see Blast radius states above) indefinitely. The pipeline must have explicit rollback steps that fire on any failure after the first Route patch. A partially-propagated state is worse than a failed deployment.

Security Context Constraint (SCC) requirements. If the application requires a non-default SCC (anything beyond restricted), that SCC must be bound to the application's service account before deployment — not the pipeline's service account. The pipeline service account should not have use on privileged or anyuid. Validate SCC bindings as part of the prerequisite check, not after ImagePullBackOff sends you to the logs at 11pm.

Tradeoffs

Fine-grained traffic control during deployment vs. Route complexity.
The alternateBackends structure gives you real percentage-based traffic splitting at the HAProxy layer. What you give up is simplicity: the Route object now has two backends, weight arithmetic must be managed explicitly (both cannot be zero; OpenShift normalises but edge cases are worth testing), and any tooling that reads or patches the Route needs to understand the alternate backend structure.

Deployment rollback via Route patch vs. keeping blue at full capacity.
Rolling back is fast — a Route patch and a propagation wait. But this only works while blue is still running and healthy. If you scale blue to zero after a successful green deployment, rollback requires a scale-up first, which adds latency under pressure. Keeping blue at one replica (standby) as shown above is the right call. It costs one pod's worth of memory.

Smoke tests against Route hostname vs. direct pod health checks.
Testing against the Route hostname gives you real traffic validation through HAProxy. It also means your smoke tests are affected by HAProxy propagation state — if you run them before the propagation loop completes, they pass against the old version. Testing against the pod IP or the Service directly is faster and more predictable, but it bypasses the traffic layer you're actually trying to validate. The HAProxy propagation wait exists because of this tradeoff, not despite it.

Feature flags as a deployment mechanism vs. as a product tool.
Flagsmith is not a deployment orchestrator. Treating it as one means your flag state becomes a deployment artifact that needs audit history, rollback procedures, and access controls that were designed for product managers, not SREs. The integration shown here is deliberately narrow: the pipeline enables one flag on successful deployment. It does not use flags to control rollout percentage — that's the Route's job. Keep these concerns separate or you end up debugging both simultaneously.

What I'd do differently

Add the HAProxy propagation validation loop on day one. Not after the first mysterious smoke test pass on a deployment that turned out to still be blue. The fixed sleep looks like it works until the cluster is under load or the Ingress Operator restarts a router pod mid-deployment. The polling loop is five more lines. Write it first.

Decouple the Flagsmith namespace from production from the start. Environments in Flagsmith are cheap to create. Having a staging environment that mirrors production flag state but requires a manual promotion to production adds an explicit gate that pays for itself the first time someone enables a flag in the wrong environment.

Build RBAC preflight checks into the pipeline as a first step. The oc auth can-i check should run before any deployment work starts. If the service account can't patch Routes, you want to know before you've deployed the new image and left green in a half-deployed state. The pipeline in Step 3 above does this correctly — this is what it looks like to get the ordering right.

Treat flaky smoke tests as blocking, not acceptable noise. A smoke test that fails intermittently is not a test that needs a retry loop — it is a signal about application startup behaviour or health endpoint implementation that will eventually cause a false-negative rollback or a false-positive deployment. The first time a flaky test passes when it should have failed, you will have deployed a broken version with green lights on the pipeline.

Keep blue alive at one replica as a standing policy, not a deployment configuration. The temptation after a successful deployment is to scale blue to zero to reclaim resources. The first time you need to roll back quickly under pressure, you will wish you hadn't. One pod is a small standing cost against an emergency.

GitHub repo

agentic-devops/pipelineandprompts-labs

Working implementations of all pipeline steps, the HAProxy propagation validation function, and the RBAC setup commands are in the repo.

What's next

Next in Pipelines in the Wild: pipeline observability — instrumenting GitHub Actions workflows for SRE-level visibility into deployment health. If you're newer to CI/CD pipeline architecture, that context is useful before the next article. Specifically: surfacing HAProxy propagation timing as a metric, detecting State 2 partial propagation in alerting, and building a deployment health dashboard that actually reflects what HAProxy is doing rather than what the pipeline thinks it's doing.

Found this useful? The next article in this series covers pipeline observability for OpenShift deployments.
All working code is in the GitHub repo.

DEV Community