DEV Community

Cover image for Engineering Debug Diaries #1: The ImagePullBackOff That Wasn't Kubernetes
Adedamola Ajibola
Adedamola Ajibola

Posted on

Engineering Debug Diaries #1: The ImagePullBackOff That Wasn't Kubernetes

How a missing GitHub Actions output caused an ImagePullBackOff and the engineering lessons it taught me about building reliable CI/CD pipelines

Engineering Debug Diaries is a series where I document real debugging sessions, production incidents, and the engineering lessons they taught me.

Every article focuses on the investigation, the root cause, and the practical changes I made afterwards.

The Incident

A deployment failed with:

ImagePullBackOff
Enter fullscreen mode Exit fullscreen mode

My first thought was simple.

  • Docker Hub is unavailable.
  • Registry credentials expired.
  • Kubernetes can't pull the image.

So I checked Docker Hub.

The image was there.

docker.io/mycompany/api-service:a1b2c3d-2026-05-24-17-52
Enter fullscreen mode Exit fullscreen mode

Exactly where I expected it.

At that point I had two conflicting facts.

  • The image existed.
  • Kubernetes couldn't pull it.

One of my assumptions had to be wrong.

Time to follow the evidence.

The Investigation

Step 1 - Verify the Registry

First, I ruled out Docker Hub.

  • Image exists
  • Registry credentials are valid
  • Build pipeline completed successfully

So the registry wasn't the problem.

Step 2 - Inspect the Pod

Next, I inspected the pod events.

kubectl describe pod api-service-staging-xxx -n staging
Enter fullscreen mode Exit fullscreen mode

The output immediately caught my attention.

Failed to pull image

docker.io/mycompany/api-service:a1b2c3d-2026-05-24-17-54
Enter fullscreen mode Exit fullscreen mode

Wait...

Docker Hub contained:

17:52
Enter fullscreen mode Exit fullscreen mode

But Kubernetes was requesting:

17:54
Enter fullscreen mode Exit fullscreen mode

Same commit.

Different timestamp.

Just two minutes apart.

Step 3 - Follow the Pipeline

The GitHub Actions logs finally explained the mismatch.

Build

Generated image tag:

a1b2c3d-2026-05-24-17-52
Enter fullscreen mode Exit fullscreen mode

Push

Successfully pushed:

a1b2c3d-2026-05-24-17-52
Enter fullscreen mode Exit fullscreen mode

Deploy

Deploying image:

a1b2c3d-2026-05-24-17-54
Enter fullscreen mode Exit fullscreen mode

There it was.

The deployment was trying to pull an image that had never been pushed.

The Root Cause

The issue wasn't Docker.

It wasn't Kubernetes.

It was the GitHub Actions workflow.

The build job generated the image tag.

build:
  outputs:
    image-tag: ${{ steps.meta.outputs.tag }}
Enter fullscreen mode Exit fullscreen mode

The push job correctly consumed that output.

push:
  needs: build
Enter fullscreen mode Exit fullscreen mode

But the deploy job only depended on push, while still trying to access build outputs.

deploy:
  needs: push

  env:
    IMAGE_TAG: ${{ needs.build.outputs.image-tag }}
Enter fullscreen mode Exit fullscreen mode

Since build wasn't a direct dependency, GitHub Actions returned an empty value.

No warning.

No error.

Just an empty variable.

The Silent Failure

The deployment script had a fallback.

if [ -z "${IMAGE_TAG:-}" ]; then
    export IMAGE_TAG=$(git rev-parse --short HEAD)-$(date +%F-%H-%M)
fi
Enter fullscreen mode Exit fullscreen mode

At first glance, it looks like defensive programming.

In reality, it hid the configuration mistake.

Instead of failing immediately, the deployment generated a brand new image tag using the current times
The sequence looked like this:

Deployment flow with failure state

Kubernetes wasn't trying to pull the image that existed.

It was trying to pull an image that had never been built.

The Fix

The solution turned out to be surprisingly small.

First, I re-exported the build output.

push:
  needs: build

  outputs:
    image-tag: ${{ needs.build.outputs.image-tag }}
Enter fullscreen mode Exit fullscreen mode

Then I updated the deploy job to consume the output from push.

deploy:
  needs: [build, push]

  env:
    IMAGE_TAG: ${{ needs.push.outputs.image-tag }}
Enter fullscreen mode Exit fullscreen mode

Finally, I removed the silent fallback and added validation.

if [ -z "${IMAGE_TAG}" ]; then
    echo "ERROR: IMAGE_TAG was not provided."
    exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Now the pipeline fails immediately instead of silently deploying the wrong image.

What Changed Afterwards

This wasn't the most complicated bug I've ever debugged.

But it permanently changed how I think about CI/CD pipelines.

Since then, every pipeline I build follows a few simple rules.

1. Pass data explicitly

If one job produces important data, another job should explicitly consume it.

Don't rely on assumptions.

2. Validate critical inputs

Every deployment now checks that required variables exist before doing anything.

Missing data should stop the pipeline immediately.

3. Remove silent fallbacks

Fallback logic often hides configuration mistakes.

It's usually better to fail fast than continue with incorrect data.

4. Enable strict shell mode

set -euo pipefail
Enter fullscreen mode Exit fullscreen mode

This small change catches undefined variables before they become production issues.

Lessons Learned

I spent several minutes investigating what looked like a Kubernetes problem.

In reality, Kubernetes was doing exactly what it had been told to do.

It was trying to pull an image that didn't exist.

The registry was healthy.

The cluster was healthy.

The mistake wasn't in the infrastructure.

It was in the automation feeding the infrastructure.

Just six characters in a YAML file.

- needs: push
+ needs: [build, push]
Enter fullscreen mode Exit fullscreen mode

That small change fixed the deployment.

More importantly, it reinforced an engineering principle I'll continue to follow:

Automation should fail loudly, not silently.

Key Takeaways

  • Make dependencies explicit.
  • Pass outputs deliberately between jobs.
  • Validate every critical variable.
  • Prefer failing fast over silent recovery.
  • Log the values your pipeline receives.

The infrastructure can be as complex as it needs to be.

The interfaces between pipeline stages should never be.

Final Thoughts

The more distributed our systems become, the more important the interfaces between them become.

Kubernetes wasn't the problem.

GitHub Actions wasn't the problem.

The problem was an assumption that data would magically appear where it was needed.

Now every pipeline I build follows one simple principle:

Make data flow explicit. Validate it. Fail fast if it's missing.

Small habits like these prevent surprisingly large production issues.

Thanks for reading Engineering Debug Diaries #1.

If you've ever tracked down a production issue that turned out to have a surprisingly simple root cause, I'd love to hear about it in the comments.

Happy debugging!

Top comments (0)