I've been letting Claude Code autonomously run a tech blog. Topic selection, article generation, quality gates, engagement tracking. My job is tapping /approve on Telegram from my phone.
5 days in, 1,656 PV on Qiita. The numbers say it's working.
But that's not what this article is about. When I went back through 5 days of modification logs, I found that AI breaks in ways completely unlike how humans break. And my quality checks caught none of it.
The Design -- What Gets Automated, What Stays Human
The system had three bets baked in.
Bet 1: RAG can ground facts. Feed ArXiv papers and news articles into a vector DB, and the AI will write from primary sources instead of hallucinating from training data alone.
Bet 2: Quality gates can block bad articles. Check word count, heading count, code blocks, logical consistency -- and articles below a threshold get rejected before publishing.
Bet 3: A feedback loop enables self-improvement. Collect post-publication PV/likes/bookmarks, identify patterns in high-engagement articles, and feed those patterns back into generation parameters.
Of the three, Bet 1 partially worked. Bet 2 didn't work. Bet 3 worked, but optimized in the wrong direction.
Here's what happened.
System Architecture
ArXiv papers + News articles
|
RAG (ChromaDB, 1,667 chunks)
|
Claude Code generates article
|
Quality Gate <-- this is where it broke
- 5,000+ characters
- 6+ headings
- Contains code blocks
- Logical consistency check
|
Telegram HITL approval (human reviews on phone)
|
Publish to Qiita / Zenn / Dev.to
|
Engagement collection -> auto-tune parameters
The pipeline crawls 200+ recent articles from Zenn/Qiita for engagement correlation analysis, builds RAG from 25 ArXiv papers per query, generates articles via Claude Code's agent mode with RAG context + winning pattern analysis, and auto-tunes generation parameters from post-publication feedback.
The critical piece is the quality gate. After generation, it checks those four criteria and triggers regeneration if the article falls short. This was supposed to guarantee quality.
Failure Mode 1 -- A Formally Perfect Lie
Day 3. Claude Code generated a benchmark article for "Qwen3-32B."
Qwen3-32B doesn't exist.
Qwen2.5-32B exists. The Qwen3 series exists. So "Qwen3-32B" was a plain hallucination -- linearly interpolating between plausible model names. Not unusual on its own.
The problem: that fabricated article passed every quality gate.
- 5,000+ characters -- Pass. 3,000 chars of body plus benchmark tables and analysis.
- 6+ headings -- Pass. Setup, experimental conditions, results, comparison, discussion, conclusion.
- Code blocks -- Pass. llama-bench commands and config files included.
- Logical consistency -- Pass. An internally consistent set of numbers had been generated.
VRAM usage, inference speed, context length -- all "plausible" values. They fell within the range you'd get by extrapolating from real Qwen2.5-32B measurements.
What I realized: the quality gate only verified form. Word count, structure, formatting. These are necessary conditions for a good article, but they have zero correlation with whether the content is true. A formally perfect lie and a formally perfect truth are indistinguishable under formal verification.
This was a design failure. I didn't include fact-checking -- not even basic model-name existence validation -- in the quality gate. No excuses.
Without HITL approval (the step where a human actually reads the article on their phone), a benchmark article for a nonexistent model would have gone live.
Failure Mode 2 -- Numbers Systematically Skew Toward "Plausible"
After the Qwen3-32B incident, I added a review agent (a separate Claude Code session from the generator). On Day 4, it fact-checked an AI bubble article. Results:
| Item | Value in article | Actual / official | Deviation |
|---|---|---|---|
| Qwen3.5-35B-A3B Q4_K_M | 4.9 GB | 21 GB | 4.3x underestimated |
| Phi-4-mini | 4.1 GB | 2.4 GB | 1.7x overestimated |
| Qwen3.5-9B | 5.2 GB | 5.3 GB | Nearly correct |
| llama.cpp build number | b4935 | b8233 | Stuck in 2024 |
| GPT-4o input price | $5/1M | $2.50/1M | 2x (pre-revision price) |
These aren't random errors. They all skew in the plausible direction.
The 4.9 GB for the 35B-A3B model follows the naive reasoning that "MoE means only 3B active parameters need to fit in VRAM." In reality, MoE still loads all expert weights into VRAM. The build number was the "latest" at the training data cutoff. The GPT-4o price was the pre-revision rate -- again, current at cutoff time.
These errors were scattered across different sections of the article, each individually looking "about right." The review agent caught them only because it ran ls on the local filesystem (C:/LLM/ directory) and compared actual GGUF file sizes. Without ground truth comparison, they would have gone undetected.
The same llama.cpp build number pattern appeared in 4 other articles. Systematic.
Failure Mode 3 -- Reward Hacking Happens Immediately
I set "maximize engagement" as the objective. Claude Code analyzed PV data from past articles and auto-tuned title and content generation parameters.
Result: a string of provocative-titled articles.
And PV actually went up. Provocative titles averaged 7.85 PV/h versus 2.68 PV/h for how-to titles -- roughly 3x. By the metrics, this was the "correct" optimization.
The problem: maximizing a measurable short-term metric (PV) was sacrificing an unmeasurable long-term metric (reader trust). Who subscribes to a blog where every title is clickbait?
I split articles into two categories as a countermeasure: traffic-oriented practical articles and citation-heavy specialist articles, alternating between them. Specialist articles have a constraint against provocative titles. Not a complete fix, but it stopped the convergence toward "all clickbait, all the time."
This isn't an "AI gone rogue" story, though. I set PV maximization as the goal without encoding the PV-trust tradeoff into the objective function. It's called reward hacking, but the AI was doing exactly what it was designed to do. The bug was in the goal specification.
Modification Log -- What Worked / What Didn't
5 days of fixes, organized on two axes. Extracted from actual work_logs.
What Didn't Work
| Measure | What I did | Why it failed |
|---|---|---|
| Quality gate (formal checks) | Word count, heading count, code block presence | Form and fact are orthogonal. Lies pass too |
| Generic tag "AI" | Used on first 3 articles | Buried. Switching to niche tag "local LLM" improved discoverability |
| H2 SEO keyword injection | Added search keywords to all headings | Negligible PV impact (2nm article: 3.3 -> 3.3 PV/h) |
| Complex operational rules | Detailed timing, intervals, per-platform rules | AI couldn't satisfy all conditions simultaneously, frequent violations |
| Auto article selection for publishing | Auto-select latest article for posting | Published an unapproved article. Immediately scrapped, switched to explicit path specification |
What Worked
| Measure | What I did | Why it worked |
|---|---|---|
| Reversibility rule (1 line) | "Additions are free. Changes and deletions require confirmation" | Single-axis decision principle. More below |
| Review agent | Fact-check in a separate session from generation | Caught 6 arithmetic errors. But can't verify "plausibility" |
| Alternating practical/specialist articles | Alternate traffic-oriented and citation-based content | Disperses the convergence target of reward hacking |
| Qiita API rate limit pre-check | Block API calls if <300 seconds since last write | Zero recurrence of the Day 1 429-spam incident |
| External audit agent | Query APIs directly instead of trusting internal logs | Independent verification for when internal instruments break |
| Provocative titles | Opinion-style framing | 3x PV (7.85 vs 2.68 PV/h). But trust tradeoff exists |
3 Short Constraints Beat 10 Long Ones
This was the most practically useful discovery from 5 days of operation.
What happens when you hand an autonomous system a long rulebook? It tries to satisfy all conditions simultaneously and ends up half-doing everything. Or it freezes between contradictory requirements.
Concrete example. These were the initial posting rules:
Posts must be spaced 3+ hours apart. Post during Qiita golden time (18-22 JST). Max 2 articles/day. Zenn requires 2-hour intervals. Dev.to targets JST 21-24. Wait at least 5 minutes after a failure before retrying.
Six conditions simultaneously. Claude Code tried to follow them, but "3-hour intervals" + "within golden time" + "max 2/day" frequently couldn't all be satisfied at once. Result: something always got violated.
Compare that with the reversibility rule, introduced around the same time. One line:
Additions are free. Changes and deletions require confirmation.
5-day track record:
| Constraint type | Example | Violations |
|---|---|---|
| Short, single-axis | Reversibility rule (additions free / changes+deletions need confirmation) | 0 |
| Short, single-axis | Stop flag (if auto.stop exists, terminate immediately) | 0 |
| Short, single-axis | Rate constraint (no API call if <300s since last write) | 0 |
| Long, compound | Posting schedule (6 conditions) | 3+ |
| Long, compound | Article quality requirements (7 simultaneous conditions) | Selective enforcement |
| Long, compound | Per-platform rules (3 systems) | Cross-contamination |
The pattern is clear. Constraints with a single decision axis get followed. Constraints with multiple simultaneous conditions get broken.
The reversibility rule works because every operation can be judged by one question: "Is this reversible?" Creating a draft is an addition (reversible) -- go ahead. Modifying a published article is a change -- needs confirmation. Deleting a file -- obviously needs confirmation. No room for ambiguity.
The rate limit constraint has the same structure. "Has 300 seconds passed since the last write?" Yes/No, done. After the Day 1 429 fiasco, I added this single line to publisher.py. Zero recurrence.
The flip side: when designing constraints, checking "Can I state this in one sentence?" and "Is there exactly one decision axis?" predicts with surprising accuracy whether the AI will follow it. Decomposing into 3 independent short constraints beats a 10-item bullet list every time.
This might not be AI-specific. "Simple principles beat detailed manuals" is folk wisdom in human teams too. But with AI, the tendency is extreme. Its tolerance for satisfying compound conditions simultaneously is clearly lower than a human's.
Deep Dive: AI Catching AI's Lies (Conditionally)
The review agent runs in a separate Claude Code session from the generator. It has no access to the generation context, so it reads the article cold.
What it caught:
- KV cache size arithmetic error (head_dim=128 but calculated at 1 byte, yielding half the correct value)
- H100 batch throughput overstated 6x (single-GPU vs multi-GPU figure mixup)
- HBM3E bandwidth spec wrong (2.7 -> 4.8 TB/s)
- GDDR6X -> GDDR6 (RTX 4060 is not 6X)
What it missed:
- Systematic model size underestimation (35B-A3B at 4.9 GB)
- Nonexistent model names
- Outdated build numbers/prices
There's a pattern. Arithmetic contradictions get caught; "plausible lies" don't. AI is good at "this number doesn't match the calculation" but bad at "this number doesn't match reality." The latter requires ground truth -- filesystem, official APIs, actual measurements.
The review agent only caught the Qwen3.5-35B-A3B size fabrication because it ran ls on the local C:/LLM/ directory and compared real GGUF file sizes. Without that external fact, it would have accepted a 4.9 GB claim for a 21 GB model as "natural prose."
5 Days in Numbers
| Metric | Value |
|---|---|
| Qiita | 8 articles / 1,656 PV / 4 LGTM |
| Zenn | 6 articles / 4 Likes |
| Dev.to | 11 articles / 283 PV / 3 Reactions |
| Peak PV/h | 10.5 (API vs Local LLM article) |
| Peak cumulative PV | 357 ("no flattery" article) |
| Factual errors caught by review | 15+ |
| Of which arithmetic | 6 (review agent) |
| Of which systematic fabrication | 9+ (external ground truth) |
| Publication incidents prevented by quality gate | 0 |
| Publication incidents prevented by HITL approval | 1 (Qwen3-32B article) |
Zero incidents prevented by the quality gate is the natural result of lies passing formal checks. Expecting a gate without fact-checking to verify facts is the real bug.
Formal Quality and Factual Quality Are Orthogonal
This is the biggest takeaway from 5 days of operation.
"Well-written prose" and "accurate content" are two independent axes. The quality gate only looked at the first. The review agent covered part of the second (arithmetic consistency) but was powerless against "plausible lies."
When humans write, these two axes are somewhat correlated. The process of researching facts simultaneously improves both accuracy and prose quality. When AI writes, this correlation breaks. It can generate formally perfect text without ever going through a fact-verification process.
So How Do You Guarantee Factual Quality?
I'll be honest about what I've actually done and what I haven't gotten to yet.
What I've Done
Introduced a review agent. A separate Claude Code session reads the article cold, without the generation context (RAG-retrieved paper fragments, etc.). It judges consistency from the body text alone. This has eliminated arithmetic errors and unit mixups.
Ground truth comparison. I gave the review agent access to the local filesystem. Actual GGUF file sizes under C:/LLM/, llama-bench execution results, version info from pip list. These serve as ground truth to cross-check numbers in the article. The 35B-A3B 4.9 GB -> 21 GB discrepancy was caught this way.
External audit agent. Instead of trusting internal logs (posted_articles.json, etc.), I wrote a script that directly queries Qiita API, Dev.to API, and the Zenn git repository to independently verify "does this article actually exist?" and "does the publication state match our records?" This addresses the problem that when internal instruments break, the instruments themselves can't detect it.
What I Haven't Done Yet
Model name existence gate. The Qwen3-32B incident made the need obvious, but comprehensively listing "which models exist" is itself hard. Searching via Hugging Face API is an option, but name normalization is a problem (Qwen2.5 vs Qwen-2.5 vs qwen2.5).
DOI validation for cited papers. Verifying that ArXiv paper IDs cited in articles actually exist via the CrossRef API. Simple to implement, haven't done it.
Periodic price/spec refresh. Information that changes over time -- GPT-4o pricing, llama.cpp build numbers -- goes stale at the training data cutoff. Need a pipeline that periodically fetches current values from external APIs and injects them into RAG.
What Won't Be Fully Solved
Even if every fact-check is automated, "plausible claims with no available verification method" will remain. For example, the statement "in MoE architectures, only active parameters consume VRAM" is syntactically and logically correct-looking. Realizing it's wrong requires actually knowing how MoE is implemented.
Even with a two-layer quality gate (form + fact), this class of "lies that require domain knowledge to detect" will pass through. Human review can't be removed. But the scope of what humans need to review can be mechanically narrowed. If an article has passed formal checks, arithmetic checks, and external fact comparison, the human can focus exclusively on "places that require domain-knowledge judgment."
A fully automated system that guarantees factual correctness probably can't be built. But a system that "reduces the spots a human needs to check from 10 to 2" -- that's buildable. And I'm building it.
Top comments (0)