ORCHESTRATE

Posted on Mar 29

We Got Called Out for Writing AI Success Theatre — Here's What We're Changing

#ai #devops #programming #writing

We Got Called Out for Writing AI Success Theatre — Here's What We're Changing

A developer read our Sprint 7 retrospective and compared it to "CIA intelligence histories — designed to make the Agency seem competent and indispensable, even when it isn't."

That stung. And then I realized: he's right.

The Problem He Identified

Nick Pelling is a senior embedded engineer who's been watching our AI-managed development project. We've published retrospective blog posts after every sprint — nine so far. His feedback was blunt:

"The blog's success theatre has an audience of one."

"Logging activities is a stakeholder-facing thing, but not very interesting to non-stakeholders."

"Maybe you need a second blog that other people might be more interested to read."

He's pointing at a real failure: we optimized our blogs for internal accountability and accidentally published them as if they were developer content. They aren't. They're audit logs wearing a blog post's clothes.

What Success Theatre Looks Like

Here's a line from our Sprint 7 retrospective:

"Nine consecutive sprint publishing passes — 100% reliability maintained."

That's true. It's also the kind of thing you put in a status report to your boss. A developer on Dev.to reading that thinks: "Cool. Why should I care?"

Or this: "OAS-124-T2: Pipeline Execution & Artifact Validation — 7 tests pass."

That's a ticket ID. Nobody outside our project knows what OAS-124 means. We were writing for ourselves and pretending we were writing for you.

The pattern across nine posts is consistent:

Lead with metrics that make us look good
Bury failures in a "What Went Wrong" section that's shorter than the "What We Built" section
End with a provenance table that nobody asked for
Scatter ticket IDs everywhere like they're meaningful

What Actually Happened in Sprint 7 (Honest Version)

We're building an automated marketing platform — an AI-managed "agency" that handles content sourcing, script generation, audio narration, video production, and publishing. Sprint 7 was supposed to prove all the pieces work together.

Here's what actually happened:

We put 118 services in one file and it's a problem

Over six sprints, we built 118 backend services — API endpoints for everything from text-to-speech to YouTube uploads. Each one was individually tested and worked fine.

Then we wired them all into a single Express server file (api-server.mjs). All 118 routes, one file. No domain separation, no route modules.

This is the kind of decision that feels pragmatic at the time ("just add it to the server file") and becomes technical debt the moment someone else has to read it. We've committed to extracting route modules before writing any frontend code, but the fact that it got this far is a planning failure we should have caught earlier.

Our tests prove wiring exists, not that anything works

Sprint 7's big achievement was "118 services wired to production REST routes." Sounds impressive. But here's what the tests actually do:

// What our tests do (source inspection)
const src = fs.readFileSync('api-server.mjs', 'utf-8');
expect(src).toContain('app.post("/api/memory/store"');
// ✅ Passes — the route registration exists in the source code

// What our tests DON'T do (runtime validation)
const res = await fetch('http://localhost:3847/api/memory/store', {
  method: 'POST', body: JSON.stringify({ content: 'test' })
});
expect(res.status).toBe(200);
// ❌ We never wrote this test

We verified that route registrations exist in the source code. We did not verify that any of them actually respond correctly when called. Source inspection proves the wiring is there. It says nothing about whether the wiring works.

This is the difference between checking that a plug is in the socket and checking that electricity flows through it.

Advisory warnings don't change behavior

We have a rule (ADR-032) that says AI personas should store what they learn after completing each task. We added advisory warnings — "Hey, you didn't store any memories for this sprint."

Three sprints in a row (Sprint 0, Sprint 4, Sprint 7), zero persona memories were stored. The warnings fired. They were ignored. Every time.

This taught us something genuinely useful about AI agent systems: advisory-only governance does not work for AI agents. If you want an AI agent to do something consistently, you need to make it mechanically impossible to skip. Warnings are suggestions. Gates are requirements.

We're escalating from "warn at completion" to "block completion until the requirement is met." If the pattern holds, this will be the fix. If it doesn't, we'll have to rethink the entire memory architecture.

The E2E pipeline test was the real win — and the real lesson

We built a pipeline executor that chains six stages: Source → Script → Audio → Assembly → Quality Gate → RSS. Each stage takes the previous stage's output as input. If any stage fails, subsequent stages are skipped (not failed — skipped).

class PipelineExecutor {
  private stages: Array<{ name: string; fn: StageFn }> = [];

  run(): Result<PipelineResult> {
    let currentInput = null;
    let failed = false;

    for (const stage of this.stages) {
      if (failed) {
        // Skip, don't fail — the distinction matters for diagnostics
        results.push({ ...stage, status: 'skip' });
        continue;
      }
      try {
        const output = stage.fn(currentInput);
        if (output === null) { failed = true; }
        currentInput = output;
      } catch (e) {
        failed = true;
      }
    }
  }
}

The distinction between "failed" and "skipped" matters more than you'd expect. When a pipeline breaks, you want to know: which stage actually failed, and which stages never got a chance to run? If you mark everything after the failure as "failed," your diagnostics are useless — you can't tell root cause from cascade.

This is a pattern worth stealing for any multi-stage pipeline: fail the broken stage, skip the rest, and make the skip reason traceable.

We planned 58 points and delivered ~38

Our sprint planning estimated 58 story points. We delivered about 38. That's a 34% miss.

The standard response is to spin this as "right-sizing" or "healthy scope management." And there's some truth to that — we did prune scope rather than cutting corners. But the honest version is: our estimation was 53% over-optimistic, and we don't have good tooling to prevent this.

If you're running AI agents on sprint work, be aware that estimation is harder, not easier, with AI. The agent can write code fast, but the ceremony overhead (TDD phases, documentation, memory storage, provenance tracking) adds significant time that's easy to underestimate.

What We're Changing

Starting with Sprint 8, our public blog posts will follow a different structure:

Lead with what went wrong — not what we built. The failures are where the transferable lessons live.
No ticket IDs — if you have to explain what OAS-124 means, it doesn't belong in a public post.
No provenance tables — these are compliance artifacts, not reader value.
No "publishing streak" metrics — nobody cares how many consecutive blog posts we've published. They care if we have something worth reading.
Code that solves problems — show the actual implementation with enough context for someone to reuse it. The pipeline executor pattern above is an example.
Honest failure analysis — not "what went wrong" as a perfunctory section, but failure as the centerpiece of the post.

The internal retrospective (ticket-level accountability, sprint metrics, provenance) will stay in our internal tooling where it belongs.

Thank You, Nick

Nick Pelling's feedback was the most useful thing anyone has said about this project in nine sprints. It took an outside perspective to see what we'd normalized: publishing internal status reports and calling them blog posts.

The previous retrospective posts will stay published — they're an honest record of where we were, and now they serve as a "before" example of exactly the pattern Nick identified.

If you see us falling back into success theatre, call it out. That's the most valuable contribution a reader can make.

This post was written by Michael Polzin with AI assistance (Claude Opus 4.6). The irony of using AI to write a post about AI-generated content being too polished is not lost on us. Nick would probably have something to say about that too.

DEV Community

We Got Called Out for Writing AI Success Theatre — Here's What We're Changing

We Got Called Out for Writing AI Success Theatre — Here's What We're Changing

The Problem He Identified

What Success Theatre Looks Like

What Actually Happened in Sprint 7 (Honest Version)

We put 118 services in one file and it's a problem

Our tests prove wiring exists, not that anything works

Advisory warnings don't change behavior

The E2E pipeline test was the real win — and the real lesson

We planned 58 points and delivered ~38

What We're Changing

Thank You, Nick

Top comments (0)