DEV Community

Jean-Rodney Larrieux
Jean-Rodney Larrieux

Posted on

How I Spent My Vacation Day: Debugging Docker with AI (And Recovering 39GB)

Should have been poolside. Ended up in a terminal. Best vacation day ever.

Yesterday morning, sipping my first coffee and checking my home server dashboard, I saw it: 89% disk usage on my 512GB SSD. My production Nomad cluster with 173 containers was slowly choking to death.

I told myself "just a quick look." Six hours later, I'd solved a mystery that would have taken me days without my AI debugging partner.

The Mystery That Hooked Me 🕵️

The numbers didn't add up:

  • Docker reporting 99GB of space used
  • Actual images: only 13GB
  • Where was the missing 86GB?

This wasn't just about disk space. With 25 replicas each of my core services (ferengi, price-service, formatter), something was fundamentally wrong with my container architecture.

Enter My AI Debugging Partner 🤖

Instead of random Googling, I structured this as a systematic investigation with Claude. The collaboration pattern that emerged was powerful:

Me: "Here's what I'm seeing..." (intuition, context, constraints)

AI: "Let's check X, Y, Z systematically..." (comprehensive analysis, documentation)

Me: "That's weird, but this makes sense..." (creative leaps, real-world experience)

We built diagnostic commands step by step, creating a repeatable investigation process.

Down the Rabbit Hole 🐰

Hour 2: Found the smoking gun - 482 overlay2 directories vs 173 running containers. Docker had accumulated massive "ghost layers" from months of deployments.

Hour 3: Built a cleanup script, recovered 93GB. Victory! 🎉

Hour 4: Plot twist - everything broke. CNI networking conflicts, ECR authentication failures, Docker metadata corruption.

Hours 5-6: "This should have been simple..." 😅

Each new problem became a mini-learning session. Docker internals, CNI plugins, Nomad networking - concepts I'd used but never deeply understood were now crystal clear through systematic AI-assisted debugging.

The Real Breakthrough 💡

The infrastructure cleanup was only half the story. Even after recovery, my containers were consuming 800MB+ each - multiply that by 75+ containers and you're right back where you started.

The aha moment: Attack the problem at TWO levels:

  1. Infrastructure: Clean ghost layers and corrupted state
  2. Application: Optimize the images themselves

Multi-Stage Docker Builds to the Rescue 🏗️

My old Dockerfile was a disaster:

FROM python:3.10-slim
# Install EVERYTHING including build tools
RUN apt-get install gcc build-essential git...
# Build tools live forever in final image
Enter fullscreen mode Exit fullscreen mode

New approach:

# Stage 1: Builder (with all the messy build tools)
FROM python:3.10-bookworm as builder
# Do all the building, installing, compiling...

# Stage 2: Clean runtime (minimal dependencies)
FROM python:3.10-slim
COPY --from=builder /app /app
# Only copy what you need to RUN, not BUILD
Enter fullscreen mode Exit fullscreen mode

The mathematics were beautiful: 500MB savings per image × layer sharing across replicas = massive space recovery.

Victory Metrics 📊

Final result: 99GB → 60GB (39GB recovered)

  • System health: 89% → 58% disk usage
  • Deployment efficiency: Images 60% smaller
  • Security win: No build tools or secrets in production images
  • Knowledge gained: Priceless

Key Takeaways for Fellow Engineers 🎯

1. The Two-Pronged Approach

Don't just fix infrastructure OR application - optimize both layers.

2. AI as Debugging Partner

LLMs excel at systematic analysis and comprehensive checklists. Humans bring intuition and creative problem-solving. Together? Unstoppable.

3. Multi-Stage Builds Aren't Optional

If you're not using them, you're shipping build tools to production. Every. Single. Time.

4. Systems Thinking Matters

Local optimizations (cleaning up space) can miss bigger architectural issues (bloated images).

5. Document Your Wins

I turned this debugging session into a repeatable script. Future me will thank present me.

The Human-AI Collaboration That Worked 🤝

What made this effective wasn't replacing human insight with AI - it was amplifying it:

  • AI strength: Systematic analysis, comprehensive command sequences, documentation
  • Human strength: Pattern recognition, creative leaps, understanding constraints
  • Together: Accelerated learning and problem-solving

Instead of days of trial-and-error, I compressed months of Docker expertise into hours of focused collaboration.

The Bottom Line 💪

Best vacation day debugging session ever. Sometimes the most rewarding problems find you when you least expect them.

Who else has had their "relaxing day off" hijacked by a fascinating technical challenge? Share your stories below! 👇


P.S. - The cleanup script is now version-controlled and includes automatic backup rotation, CNI state cleanup, and ECR re-authentication. Because future-me deserves nice things too.

Top comments (0)