What 461 Tickets Taught Me About Building AI Tools That Actually Work

#ai #programming #webdev #productivity

I have spent the last 10 sprints building an AI-powered marketing platform. 461 tickets. 144 TypeScript services. 4,418 passing tests. Four blog posts in the retrospective series documenting every failure along the way.

Here is what I actually learned. Not the technical details — those are in the series. The things that changed how I think about building AI systems.

The tool description is the real API

We built a memory system. Agents stored 64 memories. When we asked them to recall anything, they got zero results. Nine AI personas independently confirmed: the memory system is broken.

It was not broken. The search used AND-matching on keywords. Agents wrote queries like humans write sentences: seven words long. The search needed two. The data was there the entire time.

The fix was two lines of documentation in the tool description explaining how to format queries.

Every hour we spent investigating the architecture, redesigning the storage layer, and filing risk assessments was wasted. The actual problem was a tool description that said "query: string" instead of "query: 1-3 keywords, AND logic, not sentences."

If you build tools for AI agents, the description is not documentation. It is the contract. The agent will do exactly what the description teaches it to do, and nothing else.

Velocity you do not measure does not exist

We planned 120 story points into a sprint. Our actual throughput was 38. We did this for eight consecutive sprints.

When we finally checked the velocity tracking tool, it returned zero. Nobody had ever recorded the data. We were planning by ambition and measuring nothing.

The fix was mechanical: look at what you actually completed in the last five sprints. Divide by five. That is your velocity. Plan to that number. Not the number you wish it was.

After we cut Sprint 9 from 155 points to 42, the seven stories that survived were all infrastructure: SSL, test coverage, accessibility, API documentation, end-to-end pipeline verification, sidecar deployment, retrospective. None of them were features. All of them were things we had skipped while shipping features.

The overcommitment pattern was hiding infrastructure debt behind feature velocity.

AI consensus is not correctness

During our team review, we activated 14 AI personas. Each one audited the project from their expertise. Nine of fourteen reported the same critical finding: the memory system is dead.

They were all wrong. They were all wrong in the same way, for the same reason: they all wrote natural-language queries against a keyword search engine.

When multiple AI agents agree, it feels like validation. It is not. It is a shared mode of failure. The more agents that agree, the more likely they inherited the same assumption from the same language model.

Check the assumption before you trust the consensus.

The gap between building and selling is not a feature list

We built 144 services covering content sourcing, audio synthesis, video generation, podcast assembly, YouTube publishing, quality gates, provenance tracking, and brand voice enforcement. The technical achievement is real.

Then we ran a stakeholder audit and discovered: no social media publishing, no email marketing, no analytics dashboard, no client management, no billing system. We built a content production engine and called it a marketing agency.

The seven critical business capabilities we were missing could not be solved by adding more services. They required architectural decisions about multi-tenancy, platform adapters, and revenue models that were never made during inception.

Building software is not the same as building a product. A product has customers, revenue, and a way to prove it works. We had tests.

What actually ships a production system

After all the audits, reviews, and scope cuts, here is what we determined must exist before the platform can run in production:

SSL/TLS. HTTP-only is not acceptable regardless of other priorities.
Test coverage for every service that handles user data.
Accessibility foundations before building UI components, not after.
API documentation before building API consumers.
End-to-end verification with real services, not mocks.
Monitoring that tells the operator when something breaks, not silence.
Backup procedures that have been tested, not just written.
UAT scenarios that walk through what the user actually does daily.

None of these are exciting. All of them are the difference between a demo and a deployment.

The number that matters

We have 81 stories done and 34 remaining. The 34 remaining stories have 162 tickets, every one with specific DONE criteria. Eight UAT scenarios with generated Playwright scripts. 51 architecture decision records. 48 risk and dependency entries tracked.

The plan exists. The infrastructure debt is named. The velocity is honest. The scope is realistic.

Whether it ships depends on doing the boring work first.

10 sprints. 461 tickets. The most important lesson was not technical. It was that the distance between "the code works" and "the product works" is where most projects fail. We are in that distance now.