Three days that I'm fighting the same problem.
Three days on the same bugs. Fix one thing, break another. Fix that one, the first thing breaks again. The kind of loop where you start wondering if the code is haunted.
My second instinct (honestly, not the first) was to blame the model. "Claude Code is generating garbage, time to switch to Codex. The model is broken. Classic β soon as there's traffic they nerf it."
It wasn't broken. I made a rookie mistake. And I'm going to walk you through exactly which one, because if you're working with Claude Code on a real project, you're probably making it too.
TL;DR: Five independent systems were reading and writing the same field, each with its own regex, none knowing what the others were doing. No sequential integration test. Every fix was provisional by design. The model was fine. The architecture was the bomb.
Commit 9. Same Bug. Different Fix.
The pipeline was straightforward on paper. An ecommerce client pushing product descriptions to four channels: WooCommerce (main site), Threads, a partner API, and a distributor CSV feed. One source, four outputs. A single product_content field held everything (images, internal links, HTML formatting, alt texts, source metadata).
Five separate systems touched that field:
- Image Handler: injects product image URLs, removes placeholders
- HTML Strip: cleans tags per channel (WooCommerce keeps HTML, Threads strips everything)
-
URL Rewriting: rewrites internal
/products/links to absolute URLs per channel - Alt Text Backfill: fills in missing alt text on all images
- Metadata Strip: removes source identifiers left over from the CMS export
Each transformation was a function. Each function had tests. Each function passed.
Commit 1: image fix. Tests green. Ship.
Commit 3: HTML stripping broke. Fixed. Tests green. Ship.
Commit 5: URLs not rewriting. Fixed. Tests green. Ship.
Commit 7: images back in double.
Commit 9: same spot. Different error message. Same root cause I still hadn't named.
Nine commits later, the bug was back. Exactly where it started.
My First Instinct Was Wrong. So Was My Second.
I spent four hours gaslighting a language model. It didn't complain. That made it worse.
First instinct: reread my own code. I went through every function twice. Everything looked correct. Each transformation did exactly what it claimed to do. Clean logic. Readable. No obvious side effects.
So I moved to instinct two. You know where this is going.
"The model is generating code that eats itself. The context window is polluted. It's forgetting what it fixed three sessions ago."
I started reformulating the bug with more emphasis in my prompts. Pasting more context. At some point I switched to a different model to see if the output changed. Spent about four hours in this loop.
Cool.
The humiliating part: the model was coding exactly what I asked it to maintain. The problem wasn't in how it wrote the code. The problem was in what I asked it to maintain. Five separate functions, each written in isolation, each tested in isolation, each correct in isolation. Nobody asked whether they'd interfere when run in sequence on the same string.
I hadn't asked. So Claude hadn't tested for it. That's not a model failure. That's a spec failure.
Blaming the model isn't irrational, by the way. LLMs do produce specific categories of systemic bugs: stale context across sessions, inconsistent naming conventions, duplicated logic that diverges silently. Those are real patterns. This wasn't one of them. The model was delivering exactly what I specified. What I specified was incomplete.
I changed models to see. The bug was still there. Obviously.
Five Systems. One Blob. Zero Contracts.
Shared mutable state without contracts is a time bomb. Not sometimes. Not in edge cases. Always.
It's page one of every distributed systems course. Why we have event sourcing, message queues, explicit state machines. The moment you have multiple independent processes writing to the same piece of data, you have a coordination problem. And a coordination problem without tests is just a bug waiting for the right sequence of operations to surface.
My pipeline was exactly that.
Strip images. Clean HTML. Rewrite URLs. Backfill alt text. Strip metadata. Five independent regex operations on one field, in that order (or close to it, depending on which function ran when). When I fixed the image handler regex in commit 7, it changed what string the URL rewriter received. The URL rewriter had no idea. It had passed its own tests.
No test ever ran all five transformations on the same input, in sequence.
That's the rookie mistake. Not a clever architectural failure. Not a subtle LLM-specific bug. Shared mutable state with no sequential integration test. The first thing they warn you about in every intro to distributed systems.
The first thing you forget when a client is waiting on a delivery.
The 4-Symptom Checklist: Is Your Pipeline a Ticking Bomb?
Five minutes to know if you're in the same situation.
1. A bug you fixed comes back in a different commit.
Not the exact same error, but the same behavior. You fixed it "correctly" and something downstream is silently undoing it.
2. Multiple functions or systems read and write the same field or file.
Doesn't matter if they're well-structured. Doesn't matter if they're pure functions. If they share input and output without a defined execution contract, they can collide.
3. Your tests cover each transformation in isolation but never in sequence.
Unit tests pass. Pipeline breaks. That gap is exactly the problem.
4. You've had to read another system's code to understand why yours was failing.
If the only way to debug function A is to understand what function C did to the same string three steps earlier, you don't have a bug. You have undeclared coupling.
Check 3 out of 4 and you don't have a bug. You have an architecture with no safety net.
One note: this diagnostic isn't specific to AI projects. Any data transformation pipeline can land here. But Claude Code projects are particularly exposed because the development rhythm is faster. Features ship before the tests that would have caught the architecture problem. The tests come later (if they come at all).
The Fix: 42 Tests, 3 Real Bugs, One Permanent Safety Net
I didn't architect my way out of this. I extracted the functions and wrote the tests I should have written at commit 1.
All five transformation functions out of the React components and database mutations into a standalone contentHelpers.ts. Pure functions. Zero external dependencies. Something you can actually test without spinning up a component tree.
Then I wrote 42 tests covering all five systems. The key test was F5:
// contentHelpers.test.ts β F5: Sequential pipeline integration
describe("F5 β Full pipeline (sequential)", () => {
it("runs all five transforms without conflict", () => {
const input = `

Check out our [related item](/products/widget?ref=upsell) for more details.
`.trim();
const result = runFullPipeline(input, {
channel: "woocommerce",
productSlug: "widget",
});
// Image processed exactly once β not doubled
expect((result.match(/product-image/g) || []).length).toBe(1);
// Internal URLs rewritten, including query params
expect(result).not.toContain("/products/widget");
expect(result).toContain("https://store.example.com/products/widget");
// Alt text backfilled on all images
expect(result).not.toMatch(/!\[\]\(/);
// Source metadata removed
expect(result).not.toContain("placeholder-");
});
});
Three real bugs surfaced the moment F5 ran for the first time. Each one had been silently shipping. Each one was invisible to the unit tests because none of them exercised the transforms in sequence on the same input. (A proper e2e test would have caught it too; but that means spinning up WooCommerce, hitting the partner API, and generating a real CSV. F5 caught it in 200ms with zero infrastructure. For a transformation pipeline, that's the right level.)
+286 lines net. 261 of them tests.
The rule now: no new feature that touches product_content ships without a sequential integration test. Not as policy. As a merge condition.
324 lines added. 3 bugs fixed. The next feature has to pass all 42 tests before it merges.
Tests Aren't Best Practice. They're a Contract Clause.
At a sustained Claude Code rhythm (I average around 14 sessions a day on active projects, sometimes more), each session can silently undo the work of the previous one. Not because the model is inconsistent. Because "done" has no formal definition.
Without sequential integration tests, "done" means "it works in my head." With them, "done" means "it passes 42 assertions on the full pipeline." Those aren't the same thing. The second one is actually checkable.
The distinction that matters: unit test vs. sequential integration test. A unit test tells you that a transformation does what it claims in isolation. A sequential integration test tells you that five transformations don't eat each other when they run on the same input in production order. The first gives you the illusion of coverage. The second is the only test that mirrors your actual production scenario.
F5 was that test. And it's the test that had never existed.
This is now a hard clause in how I structure Prompt Contracts before Claude Code starts a session: for any field or file touched by more than one system, the acceptance criteria explicitly include a sequential integration test. Not as a delivery condition. As a start condition. Before the first line of code gets written, the test structure is part of the brief.
Claude Code won't write that test spontaneously if you don't specify it. And if you don't specify it, you're paying for code that works in isolation and hoping the seams hold in prod.
They don't hold. I have nine commits of evidence.
Without a sequential test, every Claude Code session can undo the previous one. That's not a bug. That's the default design.
The industry will keep arguing whether Claude Code beats Codex, whether models "actually understand" architecture. Whole conference tracks dedicated to that question.
Meanwhile, the devs who build will write the sequential integration test before the feature. Not as best practice. As the only way to make "done" mean something in a shared pipeline. Clean commit. Green tests. Feature merged.
Three days to learn what's on page one of every distributed systems course. Mea culpa, Claudius. π«‘
If what I build with Claude Code interests you, subscribing means you get the next mistake as soon as I've figured out what it cost me.
(*) The cover is AI-generated. It did not require three days of debugging to produce.
Top comments (0)