Adedamola Ajibola

Posted on Jun 27

Engineering Debug Diaries #1: The ImagePullBackOff That Wasn't Kubernetes

#githubactions #devops #docker #kubernetes

How a missing GitHub Actions output caused an ImagePullBackOff and the engineering lessons it taught me about building reliable CI/CD pipelines

Engineering Debug Diaries is a series where I document real debugging sessions, production incidents, and the engineering lessons they taught me.

Every article focuses on the investigation, the root cause, and the practical changes I made afterwards.

The Incident

A deployment failed with:

ImagePullBackOff

My first thought was simple.

Docker Hub is unavailable.
Registry credentials expired.
Kubernetes can't pull the image.

So I checked Docker Hub.

The image was there.

docker.io/mycompany/api-service:a1b2c3d-2026-05-24-17-52

Exactly where I expected it.

At that point I had two conflicting facts.

The image existed.
Kubernetes couldn't pull it.

One of my assumptions had to be wrong.

Time to follow the evidence.

The Investigation

Step 1 - Verify the Registry

First, I ruled out Docker Hub.

Image exists
Registry credentials are valid
Build pipeline completed successfully

So the registry wasn't the problem.

Step 2 - Inspect the Pod

Next, I inspected the pod events.

kubectl describe pod api-service-staging-xxx -n staging

The output immediately caught my attention.

Failed to pull image

docker.io/mycompany/api-service:a1b2c3d-2026-05-24-17-54

Wait...

Docker Hub contained:

17:52

But Kubernetes was requesting:

17:54

Same commit.

Different timestamp.

Just two minutes apart.

Step 3 - Follow the Pipeline

The GitHub Actions logs finally explained the mismatch.

Build

Generated image tag:

a1b2c3d-2026-05-24-17-52

Push

Successfully pushed:

a1b2c3d-2026-05-24-17-52

Deploy

Deploying image:

a1b2c3d-2026-05-24-17-54

There it was.

The deployment was trying to pull an image that had never been pushed.

The Root Cause

The issue wasn't Docker.

It wasn't Kubernetes.

It was the GitHub Actions workflow.

The build job generated the image tag.

build:
  outputs:
    image-tag: ${{ steps.meta.outputs.tag }}

The push job correctly consumed that output.

push:
  needs: build

But the deploy job only depended on push, while still trying to access build outputs.

deploy:
  needs: push

  env:
    IMAGE_TAG: ${{ needs.build.outputs.image-tag }}

Since build wasn't a direct dependency, GitHub Actions returned an empty value.

No warning.

No error.

Just an empty variable.

The Silent Failure

The deployment script had a fallback.

if [ -z "${IMAGE_TAG:-}" ]; then
    export IMAGE_TAG=$(git rev-parse --short HEAD)-$(date +%F-%H-%M)
fi

At first glance, it looks like defensive programming.

In reality, it hid the configuration mistake.

Instead of failing immediately, the deployment generated a brand new image tag using the current times
The sequence looked like this:

Kubernetes wasn't trying to pull the image that existed.

It was trying to pull an image that had never been built.

The Fix

The solution turned out to be surprisingly small.

First, I re-exported the build output.

push:
  needs: build

  outputs:
    image-tag: ${{ needs.build.outputs.image-tag }}

Then I updated the deploy job to consume the output from push.

deploy:
  needs: [build, push]

  env:
    IMAGE_TAG: ${{ needs.push.outputs.image-tag }}

Finally, I removed the silent fallback and added validation.

if [ -z "${IMAGE_TAG}" ]; then
    echo "ERROR: IMAGE_TAG was not provided."
    exit 1
fi

Now the pipeline fails immediately instead of silently deploying the wrong image.

What Changed Afterwards

This wasn't the most complicated bug I've ever debugged.

But it permanently changed how I think about CI/CD pipelines.

Since then, every pipeline I build follows a few simple rules.

1. Pass data explicitly

If one job produces important data, another job should explicitly consume it.

Don't rely on assumptions.

2. Validate critical inputs

Every deployment now checks that required variables exist before doing anything.

Missing data should stop the pipeline immediately.

3. Remove silent fallbacks

Fallback logic often hides configuration mistakes.

It's usually better to fail fast than continue with incorrect data.

4. Enable strict shell mode

set -euo pipefail

This small change catches undefined variables before they become production issues.

Lessons Learned

I spent several minutes investigating what looked like a Kubernetes problem.

In reality, Kubernetes was doing exactly what it had been told to do.

It was trying to pull an image that didn't exist.

The registry was healthy.

The cluster was healthy.

The mistake wasn't in the infrastructure.

It was in the automation feeding the infrastructure.

Just six characters in a YAML file.

- needs: push
+ needs: [build, push]

That small change fixed the deployment.

More importantly, it reinforced an engineering principle I'll continue to follow:

Automation should fail loudly, not silently.

Key Takeaways

Make dependencies explicit.
Pass outputs deliberately between jobs.
Validate every critical variable.
Prefer failing fast over silent recovery.
Log the values your pipeline receives.

The infrastructure can be as complex as it needs to be.

The interfaces between pipeline stages should never be.

Final Thoughts

The more distributed our systems become, the more important the interfaces between them become.

Kubernetes wasn't the problem.

GitHub Actions wasn't the problem.

The problem was an assumption that data would magically appear where it was needed.

Now every pipeline I build follows one simple principle:

Make data flow explicit. Validate it. Fail fast if it's missing.

Small habits like these prevent surprisingly large production issues.

Thanks for reading Engineering Debug Diaries #1.

If you've ever tracked down a production issue that turned out to have a surprisingly simple root cause, I'd love to hear about it in the comments.

Happy debugging!

DEV Community