ORCHESTRATE

Posted on Mar 30

5,575 Tests and Zero Proof: Why We Are Scrapping Our Sprint 11 Plan

#orchestrate #agile #ai #testing

The Question That Changed Everything

Sprint 10b was mechanically perfect. Eight stories, 41 points, 450 new tests, every ticket through full TDD cycles. The burndown looked great. The test suite hit 5,575 passing tests across 334 files. We published documentation, created DR runbooks, built load testing frameworks.

Then the stakeholder asked: "When does someone actually use this thing?"

That question broke everything.

What 5,575 Tests Actually Prove

We audited every test file. Here's what we found:

NFR threshold tests â€” Pure functions that validate evaluateLatency(45, 50) returns PASS. They never measured an actual API call against the running server.

Security tests â€” File scanners that grep source code for credential patterns. They never tested whether auth actually blocks unauthorized requests.

Load test framework â€” Functions that compute percentiles from synthetic arrays. They never sent an HTTP request.

UAT tests â€” File existence checks. They read Playwright spec files as strings and regex-match for patterns like testValidLogin. They never launched a browser. The Playwright specs themselves have never been executed.

Component tests â€” Read .tsx files as text and check that export function HealthPanel exists. They never rendered a component.

Backup tests â€” Create SQLite databases in temp directories, copy files, run PRAGMA integrity_check. They never backed up the production database.

Every single test validates our test infrastructure. Not one validates the product.

The Uncomfortable Math

17 UI tabs exist. 5 are functional. 12 are scaffolding.
144 TypeScript services exist. The number that have been called against the running server: unknown, but likely single digits.
102 MCP tools are registered. How many actually work when invoked: untested.
6 content channels were promised. 1 works (LinkedIn). The other 5 have code artifacts but zero end-to-end proof.
4 Docker sidecar services are defined. None have been started.

What We're Doing About It

We scrapped the original Sprint 11 plan (Reddit adapter, Newsletter adapter, Podcast E2E, Integration chain tests, Commercial packaging v2).

The new Sprint 11 has 10 stories, each ending with a proof-point a stakeholder can see:

Platform Smoke Test â€” Open browser, click all 17 tabs, document what works vs crashes
LinkedIn E2E â€” Create post in UI, watch scheduler publish it, confirm on LinkedIn
Reddit Activation â€” Configure OAuth, submit a post, see it on reddit.com
Newsletter E2E â€” Add subscriber, create campaign, verify data persists
Content Sourcing â€” Register RSS feed, poll, see articles in the platform
Audio/TTS â€” Send text, get audio file, play it
Quality Gate â€” Submit content, get quality score, see pass/fail decision
HITL Review â€” Approve/reject content, see decision stored in memory
V3 Deployment â€” Run docker-compose.v3.yml, see all services healthy
Gap Assessment â€” Honest inventory of every inception promise vs reality

Nothing was dropped. Reddit and Newsletter moved from standalone adapter stories to full E2E validation stories. Podcast is a stretch goal under Audio/TTS. The only deferral is Commercial Packaging v2 â€” because you cannot package what you haven't proven works.

The Pattern We Fell Into

Here's how it happened:

Sprint 10a established a test pattern for components: read the .tsx file as a string, regex-match for expected patterns. This was pragmatic â€” no jsdom dependency, fast execution, catches regressions. But it's a structural test, not a behavioral test. It proves the code exists. It doesn't prove it works.

That pattern propagated. NFR tests became pure functions. Security tests became grep operations. Load tests became percentile calculators. Each test file was individually correct. Together, they created an illusion of coverage.

The TDD ceremony (RED-VERIFY-GREEN-REFACTOR-VALIDATE-DONE) reinforced it. Every ticket had 6 comments with evidence. Every test file had a VALIDATE comment with a DONE checklist. The methodology was followed perfectly. But the methodology's enforcement checks keywords in comments â€” it doesn't check whether the test actually exercises the running system.

What Other Teams Can Learn

Rule: A test suite that never hits the running system
      is a test suite for the test infrastructure.

If your TDD cycle writes a failing test, makes it pass, and the "system under test" is a pure function defined in the test file itself â€” you've tested your test, not your product.

The fix isn't to stop writing unit tests. The fix is to add a validation layer:

Unit tests     â†’ prove functions return correct values
Integration    â†’ prove services talk to each other
E2E            â†’ prove the user can do what you promised

We had extensive Layer 1. Zero Layers 2 and 3.

Sprint 11 is our Layer 2 and 3 sprint. By the end, every feature will either have stakeholder-visible proof or an honest "not working yet" status.

The Honest Numbers

After 10 sprints:

5,575 tests passing (all unit/structural)
0 integration tests against running server
0 E2E tests with real browser
1 of 6 channels validated in production
5 of 17 UI tabs confirmed functional
321 LinkedIn posts actually published (this works)
~$0 revenue from the platform itself

The LinkedIn scheduler works. It has been working since Sprint 2. Everything built on top of it in Sprints 3-10 needs to be proven.

That's what Sprint 11 is for.

DEV Community