ORCHESTRATE

Posted on Mar 30

What 10 Sprints of AI-Driven Development Actually Taught Us

#orchestrate #agile #ai #tdd

What Broke First

The earliest failure was assuming AI agents could maintain context across sprint boundaries without explicit memory infrastructure. Sprint 2 exposed this when repeated decisions were re-debated because no one stored them. We lost roughly 30% of Sprint 2 velocity to rework caused by forgotten architectural decisions.

The second painful lesson: TDD phase discipline felt like overhead until Sprint 4, when a missed VERIFY phase let a false-positive test through. The bug survived three sprints before a regression test caught it. After that, the RED-VERIFY-GREEN-REFACTOR-VALIDATE sequence stopped being a suggestion and became muscle memory.

The 10-Sprint Arc

Starting from zero, the program grew across 10 two-week sprints:

Sprint 1-2: Foundation. MCP server, basic CRUD, scheduler core. Velocity: ~15 points/sprint. Test count: 24 to 52.
Sprint 3-4: Content pipeline. Audio narration, multi-page support, queue management. Velocity stabilized at ~18. Tests: 78 to 112.
Sprint 5-6: Intelligence layer. Analytics, feedback loops, voice style engine. Velocity: ~23. Tests: 156 to 198.
Sprint 7-8: Scale and hardness. Load testing, NFR validation, CI pipeline. Velocity: ~25. Tests: 245 to 301.
Sprint 9-10: Production hardening. Auth, health checks, Docker optimization, UAT. Velocity: ~26. Tests: 362 to 415.

Cycle time dropped from 4.2 hours average in Sprint 1 to 2.0 hours by Sprint 10. The biggest improvement came from Sprint 4-5, when we stopped treating documentation as an afterthought and started binding docs to tickets before writing tests.

What We Got Wrong

Memory architecture was an afterthought. We designed the MCP tool surface first and bolted on memory later. This meant Sprint 6-7 required a painful refactor to unify persona memory with program-level search. The lesson: design your memory model before your tool surface.

Security testing started too late. Guard Ian's security review in Sprint 8 found gaps that should have been caught in Sprint 3. Input validation, rate limiting, and auth were all retrofitted rather than designed in. Cost: approximately two full stories of rework.

Documentation staleness was invisible until Sprint 5, when living_docs_manage introduced staleness detection. Before that, specs drifted from implementation silently. Multiple stories were built against outdated specs.

Reusable Patterns

Pattern 1: Result Monad for Service Boundaries

Every service function returns a typed Result instead of throwing:

type Result<T> = { ok: true; value: T } | { ok: false; error: string };

function parseSchedule(input: string): Result<Schedule> {
  if (!input || input.trim() === '') {
    return { ok: false, error: 'Schedule input cannot be empty' };
  }
  // ... parsing logic
  return { ok: true, value: schedule };
}

This eliminated try/catch sprawl and made error paths testable. Every service adopted this by Sprint 6.

Pattern 2: Pure-Function Test Architecture

Tests that need complex setup but no runtime dependencies use co-located pure functions with synthetic data:

// Define the function inline, test with synthetic data
function calculateMetrics(sprints: SprintData[]): Metrics {
  return { total: sprints.reduce((s, m) => s + m.count, 0) };
}

describe('calculateMetrics', () => {
  it('should aggregate across sprints', () => {
    const data = [{ count: 10 }, { count: 20 }];
    expect(calculateMetrics(data).total).toBe(30);
  });
});

This pattern produced 415 tests that run in under 2 seconds total.

Pattern 3: Documentation-Driven TDD

The DD-TDD workflow â€” update docs, bind to ticket, write failing test, implement, refactor, validate â€” reduced rework by an estimated 40% compared to code-first sprints. The key insight: writing the spec before the test forces you to think about the interface before the implementation.

1. Update spec/guide with intended behavior
2. living_docs_manage(action='bind', document_id, ticket_id)
3. Write failing test (TDD_RED)
4. Implement minimum to pass (TDD_GREEN)
5. Refactor (TDD_REFACTOR)
6. Validate all tests + update docs (TDD_VALIDATE)

What Actually Shipped vs. The Vision

The original V3 vision called for YouTube integration, podcast generation, AI news aggregation, and a 25-staff agency capacity. What actually shipped: a production-hardened LinkedIn campaign scheduler with 102 MCP tools, 4 LinkedIn pages, audio narration, analytics, and a health monitoring dashboard.

YouTube and podcast generation remain in the backlog. The honest assessment: we scoped aggressively and delivered the core platform solidly. The infrastructure for the remaining features exists, but the features themselves do not.

What We Would Change

Start with memory architecture. Design the knowledge graph and memory model in inception, not Sprint 6.
Security from Sprint 1. Every service gets input validation and auth checks before the first feature.
Smaller stories. Our 5-8 point stories in early sprints were too large. 2-3 point stories with 3 tickets each produced more predictable velocity.
Explicit persona handoff protocols. When Api Endor finishes a backend ticket and React Ive picks up the frontend, the handoff context was often lost. Structured handoff comments would fix this.

The program is complete. The platform works. The lessons are documented. What happens next depends on whether the patterns survive contact with the next team that picks this up.

DEV Community