I tested my autonomous content pipeline six times and found nine bugs.
The model caused exactly zero of them.
Every single failure was in the harness -- the environment around the model. This post walks through all nine, what caused them, and the one fix that retired most of them at once.
What I built
I wired up an autonomous content pipeline with Claude Code. Three independent AI sessions chain together:
- Observer -- scans the landscape (trending topics, competitor articles, performance data)
- Strategist -- picks the topic, decides the angle, writes an outline
- Marketer -- writes the full article, runs quality checks, schedules publication
Each phase is a separate Claude session. The Observer's output becomes the Strategist's input. The Strategist's output becomes the Marketer's input. No human in the loop unless something fails a quality check.
I drew this on a napkin and felt like a genius. On paper it was perfect.
# The target architecture
observer:
schedule: "0 7 * * 1" # Monday 07:00
strategist:
after: observer # Starts when Observer completes
marketer:
after: strategist # Starts when Strategist completes
Sounds clean. Reality was messier.
The 9 bugs
After six rounds of testing, I cataloged every failure. They fall into four categories.
Execution control (2 bugs)
Bug #1 -- Parallel execution conflict
The first version used three separate cron jobs, all set to the same time. The Strategist hadn't finished reading the Observer's output when the Marketer started, with no input to work from. Three people talking at once in a meeting. Nobody listening.
# Before: all fire at once
observer: "0 7 * * 1"
strategist: "0 7 * * 1"
marketer: "0 7 * * 1"
The fix was switching from time-based scheduling to event-driven chaining with after dependencies.
Bug #2 -- Cron stagger races
Even after staggering the times (07:00, 07:30, 08:00), the Strategist sometimes took longer than 30 minutes. Race condition by design.
The real fix was the same: don't schedule by clock, schedule by completion.
Data integrity (3 bugs)
Bug #3 -- Topic duplication
Without an exclusion list, the pipeline kept selecting the same topic. The Observer saw "LLMO" trending and picked it every single time.
# Fix: inject exclusion list before topic selection
existing = list_existing_articles()
prompt = f"""
Select a topic. Do NOT pick any of these (already published):
{existing}
"""
Bug #4 -- Calendar entry duplication
The pipeline registered calendar events without checking for an existing match. Run it twice, get two identical events.
Fix: delete matching entries before inserting.
Bug #5 -- Scheduling conflict with existing reservations
The auto-scheduler picked dates that already had articles scheduled. Two articles on the same day, zero on the next.
# Fix: calculate available dates first
available = get_available_publish_dates(
start=today,
count=batch_size,
existing=get_scheduled_dates()
)
Quality assurance (2 bugs)
Bug #6 -- Self-reported quality checks
The AI was checking its own work and always passing itself. "Is this article good?" "Yes, it's excellent." I had built the grading equivalent of a student marking their own homework. With a red pen. Giving themselves an A+.
Fix: run quality checks in a separate Claude session that has no memory of the writing session. Independent reviewer, not self-assessment.
Bug #7 -- Missing wit check
The quality pipeline checked for AI slop vocabulary but didn't check for wit -- the human touch that makes writing engaging instead of merely competent.
Fix: a dedicated check requiring at least two instances of wit (self-deprecation, unexpected metaphors, deflation after grand statements).
Infrastructure (2 bugs)
Bug #8 -- Bash syntax error from angle brackets
The prompt template contained <devto_id> as a placeholder. Bash interpreted < as input redirection and silently corrupted the command. No error -- just wrong output.
# Before: bash interprets <devto_id> as redirect
echo "Update article <devto_id> to published"
# After: escape or quote
echo "Update article DEVTO_ID_PLACEHOLDER to published"
Bug #9 -- at job duplication
The scheduler used at for timed publication but didn't check for existing jobs with the same article ID. Re-running the pipeline queued duplicate publish commands. Two of yesterday's article would have shown up tomorrow.
Fix: delete matching at jobs before scheduling new ones.
The pattern
None of these bugs are about the model generating bad text. The model was fine. What failed was everything around the model:
| Category | Count | Example |
|---|---|---|
| Execution control | 2 | Parallel sessions, race conditions |
| Data integrity | 3 | Duplicates, conflicts, missing exclusions |
| Quality assurance | 2 | Self-grading, missing checks |
| Infrastructure | 2 | Shell escaping, job management |
This maps cleanly to the Prompt -> Context -> Harness progression that's emerging in AI engineering:
- Prompt engineering -- optimizing what you say to the model
- Context engineering -- optimizing everything you send to the model (RAG, tools, memory)
- Harness engineering -- optimizing the environment the model operates in
All nine of my bugs were harness bugs. Y Combinator's data backs this up: 40% of AI agent projects fail, and the common thread isn't model quality. It's harness quality. By May 2026, the same pattern shows up in nearly every public agent post-mortem I've read this year -- the eval suite was missing, the queue was racy, the retry logic was self-destructive. The model was fine.
The single fix that retired half the list
The most impactful change was moving from time-based cron to event-driven dependencies.
# Final architecture
observer:
schedule: "0 7 * * 1"
strategist:
after: observer
marketer:
after: strategist
Each phase writes its output to a known location. The next phase only starts when the previous one completes successfully. If any phase fails, the chain stops -- no downstream corruption.
After implementing all nine fixes, the seventh test run produced five articles in a single batch, automatically scheduled to non-conflicting dates, each independently quality-checked. That run is, embarrassingly, the first one I trusted enough to actually look at the output of.
Takeaway
AI agent quality is determined outside the AI.
The model is the chef. The context is the ingredients. The harness is the kitchen.
If the kitchen is broken -- wrong burners firing simultaneously, ingredients getting mixed up, no one tasting the food -- it doesn't matter how talented the chef is.
I spent three hours optimizing my prompts. I spent zero minutes checking my kitchen. Turns out, I was bug #10.
Before you optimize your prompts, check your kitchen.
Want the full harness playbook? The patterns in this article -- event-driven chaining, external evaluators, exclusion lists, postflight checks -- are part of a larger framework I wrote about in Harness Engineering: From Using AI to Controlling AI.
Top comments (0)