DEV Community

Cover image for Why My CI Pipeline Kept Failing (And What I Had to Redesign to Fix It)
Vivian Chiamaka Okose
Vivian Chiamaka Okose

Posted on

Why My CI Pipeline Kept Failing (And What I Had to Redesign to Fix It)

My CI pipeline kept rejecting pushes. Same error, every single run.

error: failed to push some refs to 'https://github.com/...'
Updates were rejected because the remote contains work that
you do not have locally.
Enter fullscreen mode Exit fullscreen mode

The first time it happened I thought it was a simple git issue. Pull, rebase, push. Done.

It wasn't done. It happened again on the next run. And the one after that.

This is the story of what was actually broken, why a retry loop didn't fix it, and the architectural change that finally solved it.


What the Pipeline Was Doing

The CI pipeline builds 11 Docker images — one per microservice — and pushes them to Amazon ECR. After each build, it updates a values.yaml file in the repo with the new image tag, commits the change, and pushes it back to Git. ArgoCD watches that file and deploys automatically when it changes.

Here is a simplified version of what each job was doing:

- name: Build, tag, and push image to ECR
  run: |
    docker build -t "$IMAGE_URI" ./src/${{ matrix.service }}/
    docker push "$IMAGE_URI"

- name: Update image tag in Helm values
  run: |
    yq -i ".${{ matrix.service }}.image.tag = \"${IMAGE_TAG}\"" \
      helm-chart/values.yaml
    git add helm-chart/values.yaml
    git commit -m "ci: update ${{ matrix.service }} image to ${IMAGE_TAG}"
    git push
Enter fullscreen mode Exit fullscreen mode

Looks reasonable. The problem is that matrix strategy runs all 11 jobs in parallel.


The Actual Problem: A Race Condition

Here is what happens when 11 jobs all run at the same time:

  1. All 11 jobs check out the repo at the same commit
  2. All 11 build their image and push to ECR
  3. All 11 try to update values.yaml and push at roughly the same time
  4. The first job to push wins
  5. Every other job gets rejected because the remote has moved forward
! [rejected] vivian -> vivian (fetch first)
error: failed to push some refs
hint: Updates were rejected because the remote contains work
hint: that you do not have locally.
Enter fullscreen mode Exit fullscreen mode

This is a classic race condition. Multiple writers, one file, no coordination.

CI pipeline failing


The First Fix Attempt: Retry with Rebase

The first thing I tried was adding a retry loop. Fetch the latest, rebase, try again. If it fails, wait a random number of seconds and retry up to 5 times.

- name: Update image tag in Helm values
  run: |
    for i in 1 2 3 4 5; do
      git fetch origin
      git rebase origin/${{ github.ref_name }}

      yq -i ".${{ matrix.service }}.image.tag = \"${IMAGE_TAG}\"" \
        helm-chart/values.yaml

      git add helm-chart/values.yaml
      git diff --staged --quiet && echo "No changes to commit" && exit 0
      git commit -m "ci: update ${{ matrix.service }} image to ${IMAGE_TAG}"

      git push && echo "Push succeeded" && exit 0

      echo "Push failed, attempt $i of 5. Retrying..."
      sleep $((RANDOM % 10 + 5))
    done

    echo "All push attempts failed"
    exit 1
Enter fullscreen mode Exit fullscreen mode

This helped but didn't fully solve it. During the rebase on retry, two jobs would hit a merge conflict on the same file because both had modified values.yaml in different ways.

CONFLICT (content): Merge conflict in helm-chart/values.yaml
error: could not apply fad9a46...
Enter fullscreen mode Exit fullscreen mode

A retry loop treats the symptom. The real problem is the design.


The Real Fix: One Writer, Not Eleven

The root cause is that 11 jobs should never be writing to the same file at the same time. The solution is to separate the concerns completely.

Build jobs do one thing: build and push the image.

A single downstream job, running only after all 11 builds finish, updates values.yaml once with all the new tags in a single commit.

Here is the redesigned workflow:

jobs:
  build-and-push:
    name: Build ${{ matrix.service }}
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        service:
          - frontend
          - cartservice
          - productcatalogservice
          - currencyservice
          - paymentservice
          - shippingservice
          - emailservice
          - checkoutservice
          - recommendationservice
          - adservice
          - loadgenerator
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          fetch-depth: 0

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build, tag, and push image to ECR
        env:
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t "$IMAGE_URI" ./src/${{ matrix.service }}/
          docker push "$IMAGE_URI"

  update-helm-values:
    name: Update Helm values
    runs-on: ubuntu-latest
    needs: build-and-push
    if: github.ref == 'refs/heads/main' || github.ref == 'refs/heads/vivian'
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
          fetch-depth: 0

      - name: Install yq
        run: |
          sudo wget -qO /usr/local/bin/yq \
            https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64
          sudo chmod +x /usr/local/bin/yq

      - name: Update all image tags in Helm values
        env:
          IMAGE_TAG: ${{ github.sha }}
        run: |
          services=(
            frontend cartservice productcatalogservice currencyservice
            paymentservice shippingservice emailservice checkoutservice
            recommendationservice adservice loadgenerator
          )
          for service in "${services[@]}"; do
            yq -i ".$service.image.tag = \"${IMAGE_TAG}\"" \
              helm-chart/values.yaml
            echo "Updated $service to ${IMAGE_TAG}"
          done

      - name: Commit and push updated values
        env:
          IMAGE_TAG: ${{ github.sha }}
        run: |
          git config user.name "GitHub Actions Bot"
          git config user.email "actions@github.com"
          git add helm-chart/values.yaml
          git diff --staged --quiet && echo "No changes to commit" && exit 0
          git commit -m "ci: update all images to ${IMAGE_TAG}"
          git push origin ${{ github.ref_name }}
Enter fullscreen mode Exit fullscreen mode

The key line is needs: build-and-push. This tells GitHub Actions to wait until every single build job has completed successfully before running the update job.


What Changed and Why It Works

Before: 11 jobs, each writing to values.yaml as soon as their build finished. No coordination. First push wins, rest fail.

After: 11 jobs build and push images. They do nothing else. One job runs after all of them finish, loops through all services, updates every tag in a single pass, commits once, pushes once.

One writer. No conflicts. Clean history.

CI pipeline passing after the fix

The full workflow history tells the story clearly — the red X runs on the left where the old design kept failing, and the green checkmarks after the fix went in.

Workflow history showing the debugging journey


The Broader Lesson

Race conditions in CI pipelines are easy to miss because each individual step looks correct. The build step is correct. The push step is correct. The commit step is correct. But the system design is wrong.

When you have parallel jobs touching shared state, you need to ask: who owns this resource? In this case, values.yaml should have exactly one writer. Once I framed it that way, the fix was obvious.

If you are building a similar GitOps pipeline with GitHub Actions and Helm, separate your build jobs from your manifest update job from the start. It saves you a frustrating debugging session later.


What Comes Next

This pipeline feeds directly into ArgoCD, which watches values.yaml and syncs the cluster automatically when the file changes. In the next post I'll walk through setting up ArgoCD on EKS, connecting it to the repo, and the pod scheduling problem that showed up once everything was deployed.


This is part of an ongoing series documenting a full DevOps project built on Google's Online Boutique microservices demo, deployed to AWS EKS with Terraform, GitHub Actions, ArgoCD, Helm, Prometheus, and Grafana.

Repo: github.com/pawsible-cloud/online-boutique-platform

Top comments (0)