DEV Community

Cover image for I Tested My AI Pipeline 6 Times and Found 9 Bugs. The Model Caused Zero of Them.
Ken Imoto
Ken Imoto

Posted on • Originally published at kenimoto.dev

I Tested My AI Pipeline 6 Times and Found 9 Bugs. The Model Caused Zero of Them.

I tested my autonomous content pipeline six times and found nine bugs.

The model caused exactly zero of them.

Every single failure was in the harness -- the environment around the model. This post walks through all nine, what caused them, and the one fix that retired most of them at once.

What I built

I wired up an autonomous content pipeline with Claude Code. Three independent AI sessions chain together:

  1. Observer -- scans the landscape (trending topics, competitor articles, performance data)
  2. Strategist -- picks the topic, decides the angle, writes an outline
  3. Marketer -- writes the full article, runs quality checks, schedules publication

Each phase is a separate Claude session. The Observer's output becomes the Strategist's input. The Strategist's output becomes the Marketer's input. No human in the loop unless something fails a quality check.

I drew this on a napkin and felt like a genius. On paper it was perfect.

# The target architecture
observer:
  schedule: "0 7 * * 1"  # Monday 07:00
strategist:
  after: observer        # Starts when Observer completes
marketer:
  after: strategist      # Starts when Strategist completes
Enter fullscreen mode Exit fullscreen mode

Sounds clean. Reality was messier.

The 9 bugs

After six rounds of testing, I cataloged every failure. They fall into four categories.

Execution control (2 bugs)

Bug #1 -- Parallel execution conflict

The first version used three separate cron jobs, all set to the same time. The Strategist hadn't finished reading the Observer's output when the Marketer started, with no input to work from. Three people talking at once in a meeting. Nobody listening.

# Before: all fire at once
observer:   "0 7 * * 1"
strategist: "0 7 * * 1"
marketer:   "0 7 * * 1"
Enter fullscreen mode Exit fullscreen mode

The fix was switching from time-based scheduling to event-driven chaining with after dependencies.

Bug #2 -- Cron stagger races

Even after staggering the times (07:00, 07:30, 08:00), the Strategist sometimes took longer than 30 minutes. Race condition by design.

The real fix was the same: don't schedule by clock, schedule by completion.

Data integrity (3 bugs)

Bug #3 -- Topic duplication

Without an exclusion list, the pipeline kept selecting the same topic. The Observer saw "LLMO" trending and picked it every single time.

# Fix: inject exclusion list before topic selection
existing = list_existing_articles()
prompt = f"""
Select a topic. Do NOT pick any of these (already published):
{existing}
"""
Enter fullscreen mode Exit fullscreen mode

Bug #4 -- Calendar entry duplication

The pipeline registered calendar events without checking for an existing match. Run it twice, get two identical events.

Fix: delete matching entries before inserting.

Bug #5 -- Scheduling conflict with existing reservations

The auto-scheduler picked dates that already had articles scheduled. Two articles on the same day, zero on the next.

# Fix: calculate available dates first
available = get_available_publish_dates(
    start=today,
    count=batch_size,
    existing=get_scheduled_dates()
)
Enter fullscreen mode Exit fullscreen mode

Quality assurance (2 bugs)

Bug #6 -- Self-reported quality checks

The AI was checking its own work and always passing itself. "Is this article good?" "Yes, it's excellent." I had built the grading equivalent of a student marking their own homework. With a red pen. Giving themselves an A+.

Fix: run quality checks in a separate Claude session that has no memory of the writing session. Independent reviewer, not self-assessment.

Bug #7 -- Missing wit check

The quality pipeline checked for AI slop vocabulary but didn't check for wit -- the human touch that makes writing engaging instead of merely competent.

Fix: a dedicated check requiring at least two instances of wit (self-deprecation, unexpected metaphors, deflation after grand statements).

Infrastructure (2 bugs)

Bug #8 -- Bash syntax error from angle brackets

The prompt template contained <devto_id> as a placeholder. Bash interpreted < as input redirection and silently corrupted the command. No error -- just wrong output.

# Before: bash interprets <devto_id> as redirect
echo "Update article <devto_id> to published"

# After: escape or quote
echo "Update article DEVTO_ID_PLACEHOLDER to published"
Enter fullscreen mode Exit fullscreen mode

Bug #9 -- at job duplication

The scheduler used at for timed publication but didn't check for existing jobs with the same article ID. Re-running the pipeline queued duplicate publish commands. Two of yesterday's article would have shown up tomorrow.

Fix: delete matching at jobs before scheduling new ones.

The pattern

None of these bugs are about the model generating bad text. The model was fine. What failed was everything around the model:

Category Count Example
Execution control 2 Parallel sessions, race conditions
Data integrity 3 Duplicates, conflicts, missing exclusions
Quality assurance 2 Self-grading, missing checks
Infrastructure 2 Shell escaping, job management

This maps cleanly to the Prompt -> Context -> Harness progression that's emerging in AI engineering:

  • Prompt engineering -- optimizing what you say to the model
  • Context engineering -- optimizing everything you send to the model (RAG, tools, memory)
  • Harness engineering -- optimizing the environment the model operates in

All nine of my bugs were harness bugs. Y Combinator's data backs this up: 40% of AI agent projects fail, and the common thread isn't model quality. It's harness quality. By May 2026, the same pattern shows up in nearly every public agent post-mortem I've read this year -- the eval suite was missing, the queue was racy, the retry logic was self-destructive. The model was fine.

The single fix that retired half the list

The most impactful change was moving from time-based cron to event-driven dependencies.

# Final architecture
observer:
  schedule: "0 7 * * 1"
strategist:
  after: observer
marketer:
  after: strategist
Enter fullscreen mode Exit fullscreen mode

Each phase writes its output to a known location. The next phase only starts when the previous one completes successfully. If any phase fails, the chain stops -- no downstream corruption.

After implementing all nine fixes, the seventh test run produced five articles in a single batch, automatically scheduled to non-conflicting dates, each independently quality-checked. That run is, embarrassingly, the first one I trusted enough to actually look at the output of.

Takeaway

AI agent quality is determined outside the AI.

The model is the chef. The context is the ingredients. The harness is the kitchen.

If the kitchen is broken -- wrong burners firing simultaneously, ingredients getting mixed up, no one tasting the food -- it doesn't matter how talented the chef is.

I spent three hours optimizing my prompts. I spent zero minutes checking my kitchen. Turns out, I was bug #10.

Before you optimize your prompts, check your kitchen.


Want the full harness playbook? The patterns in this article -- event-driven chaining, external evaluators, exclusion lists, postflight checks -- are part of a larger framework I wrote about in Harness Engineering: From Using AI to Controlling AI.

Top comments (0)