We Audited Our AI-Managed Project at 46% Complete. Here's What We Found.
At 46.4% completion — 144 tickets done, 5 epics shipped, 2,633 tests passing — we stopped building and ran a full audit. Not because something broke. Because the methodology says you should, and because we wanted to know the truth before committing to 6 more sprints.
This is the unfiltered result.
The Setup
The project is a content production platform: LinkedIn scheduling, audio narration, YouTube pipelines, podcast assembly, and a Media Orchestration Engine (MOE) that coordinates it all. Think "25-person agency" run by one human and a team of AI agents.
The stack: TypeScript, SQLite (WAL mode), Express, React, Vitest, Docker. Managed through the ORCHESTRATE Agile methodology — an MCP server that mechanically enforces lifecycle modes, story formats, TDD phases, and quality gates.
Every ticket goes through Documentation-Driven TDD: write docs first, bind them to the ticket, write failing tests, implement, refactor, validate. Every phase transition requires logged evidence with minimum character counts and keyword checks. The server blocks you if you skip steps.
At the time of audit:
- 11 epics planned, 5 complete
- 84 stories, 39 done
- 266 tickets, 144 done
- 12 sprints, 6 closed
- 63,227 lines of code and tests
- Test-to-source ratio: 1.75:1
What We Audited
Six parallel analysis streams, run concurrently:
- Sprint 4–8 stories and tickets (122 remaining items) — format, acceptance criteria, DONE criteria, sizing
- All source code (87 service files) — Result pattern adoption, test coverage, tech debt
- Feature-to-stakeholder crosswalk — 101 features traced through journeys, epics, and stories
- RAID log — 20 entries checked for resolution status
- ADRs and specs — 15 architecture decisions, 27 specifications
- User journeys — 17 journeys checked for story coverage gaps
The Good
The architecture holds up. Five epics delivered cleanly: Infrastructure, Memory System, Content Sourcing & Provenance, Platform Standards, and Production Validation. These form the foundation that everything else builds on, and they're solid.
Test discipline is real. 2,633 tests with a 1.75:1 test-to-source ratio. The DD TDD workflow — write docs, bind them, write failing tests, implement, refactor, validate — produces verifiable artifacts at every phase. This isn't "we have tests." This is "every behavior is documented before it's tested before it's built."
Sprint 3 Audio Engine delivered well. 154 tests across the audio job queue, GPU concurrency with CPU fallback, voice profiles, and a human review escalation queue. The injectable dependency pattern (JobStoreLike, GpuSchedulerLike) makes everything testable without real hardware.
The planned sprints (4–8) are mostly excellent. The 6 core MOE stories, 4 YouTube stories, 3 podcast stories, and 3 quality gates stories all have proper user story format, thorough Given/When/Then acceptance criteria, well-decomposed tickets with DONE criteria, and documented edge cases. When the planning was done deliberately, it produced high-quality artifacts.
The Bad
10 Stories Are Empty Shells
This is the biggest finding. When Sprint 3's retrospective produced 6 decisions (D1–D6), we promoted them to backlog items, groomed them with acceptance criteria, and promoted them to stories in their target sprints. The promotion worked — stories were created. But:
- None have user story format ("As a [role], I need...")
- None have Given/When/Then acceptance criteria in the story description
- None have any tickets
The backlog grooming added acceptance criteria to the backlog item, but backlog_manage(action='promote') didn't carry that content into the story description. We created 10 stories that look ready but will be blocked the moment anyone tries to execute them.
That's 48 story points and 30+ missing tickets across Sprints 4, 6, 7, and 8.
Lesson: Automation that looks like it worked is more dangerous than automation that fails loudly. The MCP correctly blocks malformed stories at execution time, but the promotion step should have validated format on creation, not deferred it.
3 Failing Tests Nobody Noticed
Out of 2,637 individual tests, 4 are failing across 3 files:
-
initial-migration.test.ts— expects 1 migration, finds 2 -
nli-priority.test.ts— priority sort is non-deterministic -
provenance-schema.test.ts— missing UNIQUE constraint
These have been failing silently. They didn't block delivery because they're in completed sprints, but they represent data integrity assumptions that are wrong. The migration count mismatch means a new migration was added without updating the test. The provenance schema gap means duplicate (content_id, position_index) entries are possible.
Lesson: A green test suite is a lie if you don't look at the actual numbers. "141 passed, 3 failed" should never be an acceptable state.
Result Pattern Adoption Stalled at 49%
ADR-028 mandated that all public service methods return Result<T,E> instead of throwing or returning null. 43 of 87 services comply. 44 don't — including core data path services like queue.ts, database-manager.ts, and scheduler.ts.
The V3 services all follow the pattern. The V2 services don't. There's no migration plan, no tracking, and no timeline for bringing the remaining 44 services into compliance.
Lesson: An ADR without enforcement is a suggestion. The methodology enforces story format and TDD phases but has no gate for "does this service match the agreed architectural pattern?"
7 Tickets Are Oversized
Several tickets pack multiple distinct concerns:
- OAS-061-T1 (FFmpeg) combines composition, Ken Burns animation, text overlays, waveform visualization, AND GPU encoding — five separate FFmpeg filter chain concerns in one ticket
- OAS-070-T1/T2 (UI tabs) each contain 4–5 sub-features that should be individual tickets
- OAS-057-T3 bundles derivative content generation with a V2 API compatibility bridge
These will exceed the 150K token ceiling and force messy mid-ticket splits during delivery.
Sprint 4 Is Overloaded
Sprint 4 (Media Orchestration Engine) has 12 stories totaling 70 points — the largest sprint by far. The core MOE alone is 50 points across 6 stories. Add 4 retro-promoted stories (12 points) and 2 ceremony stories (8 points), and you have a sprint that's 75-90% over the target capacity.
This was a known risk (RAID entry c0bf41df), but no mitigation was applied.
The Ugly (Metadata Rot)
This one stings because the methodology explicitly tracks all of this:
- 101 features — every single one still shows "IDENTIFIED" status, despite 4 complete epics. Zero features have been updated to "DELIVERED."
- 20 RAID entries — all 20 are "OPEN," including 8 that are demonstrably resolved (dependencies completed, risks mitigated, assumptions validated)
- 5 ADRs — remain "PROPOSED" despite their decisions being fully implemented in shipped code
- 27 specs — all "DRAFT," including specs for completed work
- 4 functional specs have cross-wired feature IDs (FS-02 links to Audio instead of Quality Gates, etc.)
- 15 NFR specs have no feature linkage at all
The work was done. The artifacts were created. But nobody went back to update the status fields. The metadata that's supposed to prove traceability is stale.
Lesson: Completion isn't just "the code works." If your methodology generates artifacts, maintaining those artifacts IS part of done. We had a Definition of Done for tickets but not for the project's own metadata.
12 Services Have Zero Tests
Not "insufficient tests." Zero.
| Service | Lines | Risk |
|---|---|---|
linkedin-api.ts |
312 | Core integration — untested |
newsletter-api.ts |
398 | Hardcoded localhost on line 361 |
kdp-sales.ts |
349 | Revenue feature — untested |
audience-segmentation.ts |
344 | Analytics — untested |
These are V2 services that were carried forward. The 1.75:1 test ratio is real for V3 code. For V2 code, it's 0:1 in some cases.
The Remediation Plan
We categorized everything into blocking (must fix before Sprint 4), structural (should fix for right-sizing), and hygiene (should fix for traceability):
Blocking (before Sprint 4):
- Rewrite 10 retro-promoted stories with proper format and AC
- Create 30+ tickets for the empty stories
- Fix 3 failing tests
- Split Sprint 4 into 4a/4b or defer retro stories
Structural (during Sprint 4):
- Split 7 oversized tickets into right-sized units
- Split 2 INVEST-violating stories (OAS-069, OAS-070)
- Add missing WebSocket infrastructure ticket
Hygiene:
- Bulk update ~40 features to DELIVERED
- Close 8 resolved RAID entries
- Accept 5 implemented ADRs
- Update 12+ specs from DRAFT to APPROVED
- Fix 4 cross-wired feature IDs
- Link 15 NFRs to their features
Total estimated effort: ~8 hours of tool calls and test fixes.
What This Means
The project's foundations are strong. The architecture is sound, the test discipline is real, the planned work is mostly well-decomposed. But "mostly" isn't good enough when you have 122 tickets ahead of you.
The audit found that 10 of 84 stories (12%) are unexecutable in their current state. If we'd started Sprint 4 without checking, we'd have hit blocking errors on the first retro-promoted ticket and lost time debugging methodology violations instead of building features.
The metadata rot is the more insidious problem. When your traceability artifacts are stale, you can't answer "what did we actually deliver?" with confidence. The whole point of the ORCHESTRATE methodology is that every decision, every artifact, every test is traceable. Stale metadata breaks that chain.
We're now executing the remediation plan before proceeding to Sprint 4. The goal is 100% executable stories, 0 failing tests, and accurate metadata across all 101 features, 20 RAID entries, 15 ADRs, and 27 specs.
The AI agents built excellent code. They just didn't clean up after themselves.
This is part of an ongoing series documenting the ORCHESTRATE V3 platform build. Previous posts cover the Audio Engine architecture. The project uses the ORCHESTRATE Agile methodology with mechanical enforcement of lifecycle modes, DD TDD, and quality gates.
Sprint 3 Audio: 154 tests, 2,916 insertions, 8 commits. Sprint 4 MOE: next up, after remediation.
Provenance: ORCHESTRATE V3, Program Run 2428c424. Audit conducted 2026-03-28 by Content Curator persona with 6 parallel analysis agents. All findings verified against MCP tool responses and git history.
Top comments (0)