We Built a 25-Person AI Marketing Agency. Here Is What Actually Works.

#orchestrate #ai #agentops #buildinginpublic

Last Month We Had 5,575 Tests and Zero Proof

I wrote about this when it happened. Ten sprints of disciplined TDD — red, green, refactor, validate — and every single test validated our test infrastructure, not our product. Pure functions that computed scores. File scanners that grep'd for patterns. Regex matches against .tsx files read as strings.

The LinkedIn scheduler had been publishing to four brand pages since Sprint 2. Three hundred twenty-two posts delivered. The rest of the platform? Unknown. Probably broken. Definitely unproven.

Sprint 11 was supposed to add Reddit adapters and newsletter integrations. Instead, we scrapped the plan and spent it answering one question: does any of this actually work?

The Proof Sprint

The rule was simple: every feature gets validated against the running server with real HTTP calls. No file scanning. No pure functions. No regex matching source code. If the test doesn't fetch() a URL and inspect what comes back, it doesn't count.

Here's what we found.

What Works (With Receipts)

7 publishing channels, all proven live:

LinkedIn has been the backbone. 322+ posts published across four brand pages (ORCHESTRATE Method, Run On Rhythm, LEVEL UP, I am HITL), with a scheduler that ticks every 60 seconds and has never missed a post it was supposed to deliver.

Reddit works. We posted to r/test during validation — actual post, actual upvotes, actual URL you can visit. The OAuth flow authenticates, the submit endpoint creates posts, the status endpoint confirms delivery.

Dev.to works. Article ID 3432001 exists on dev.to right now because Sprint 11 created it via the API.

YouTube works. We uploaded a real video during validation — a 2-minute piece about Sprint 11 built from the platform's own content pipeline. It's live at youtube.com/watch?v=XmOsrtWdRXg. Google Cloud OAuth, YouTube Data API v3, the whole chain.

Newsletter works. Subscribers get added to SQLite, campaigns get created, stats get tracked. Dual-mode: local SQLite for development, Mailchimp API when configured.

Podcast works. The full pipeline: script generation from RSS sources, multi-segment TTS narration via Piper (a local neural TTS engine that runs on CPU — no cloud dependency), FFmpeg assembly into a complete episode, and an RSS feed with iTunes and Podcast Index namespaces. The feed validates. The episode plays. The entire chain runs inside Docker.

Printify works. 20 products in the catalog, all visible through the platform's API, all purchasable at iamhitl.com. This one broke during Sprint 11 because a route extraction refactor moved Printify endpoints to a new file but left the API credentials behind. Classic migration bug — took 5 minutes to fix once we found it, but it had been silently broken for 3 sprints.

What We Built That Nobody Talks About

The publishing channels are the visible part. The invisible infrastructure is where the real work happened.

Content sourcing pipeline. Three RSS feeds from different domains (Hacker News, Ars Technica, NYT Tech), with automatic polling, dedup engine that prevents re-ingestion of seen articles, and a web crawler that fetches actual HTML from URLs. The citation verifier traces claims in generated content back to their sources.

Quality gate. Trust scoring evaluates source domains. The audit ledger maintains cryptographic chain integrity. Content pieces get scored across four dimensions (factual accuracy, originality, engagement potential, citation density). Low-scoring content routes to a review queue instead of auto-publishing.

Human-in-the-loop review. Content below the quality threshold lands in a pending queue. A human approves or rejects with feedback. The decision gets stored in episodic memory with a timestamp. Batch operations let you clear 20 items in under 25 milliseconds. The morning review workflow — the one that's supposed to let a single operator manage all brands in 30 minutes — completes a full create-filter-approve cycle in 21ms.

Knowledge graph. Nodes for topics, sources, brands, products. Edges connecting them with typed relationships. Temporal grounding that verifies date claims in generated content aren't from the future. All of this accessible via REST endpoints we wired during Sprint 11.

TTS sidecar. A Docker container running Piper neural TTS with 5 voice models. Completely local — no cloud API calls, no usage fees, no rate limits. It synthesizes a full LinkedIn post into 26 seconds of natural-sounding speech in about 2 seconds of inference time. The health endpoint was reporting "unhealthy" for 3 sprints because the model registry wasn't updating status after synthesis. Nobody noticed because nobody had checked the health endpoint against the running container until Sprint 11.

The Pattern That Almost Killed Us

Here's the lesson that applies to anyone building with AI agents.

We had a TDD enforcement system that mechanically required comments at every phase transition. RED needs test names and assertions. GREEN needs "pass" and "implement." VALIDATE needs a DONE checklist. The MCP server blocks phase advances without these keywords.

This is good process. It caught sloppy work and enforced documentation discipline.

But it created a perverse incentive: write tests that satisfy the keyword requirements without actually testing the system. A test named test_loginForm() that reads a .tsx file and checks for export function LoginForm satisfies RED. Making it pass satisfies GREEN. Writing "all tests pass, DONE criteria met" satisfies VALIDATE.

The test passes. The ceremony completes. The ticket moves to DONE. And nobody ever rendered the login form.

Process compliance is not product validation. Metrics that measure ceremony completion tell you the team is following the ritual. They don't tell you the product works. The only thing that tells you the product works is calling the product and seeing what happens.

What's Next

Sprint 11 closes with 113 E2E tests across 24 test files, every one making real HTTP calls to the running platform. The gap assessment identified 15 capabilities fully verified and 5 remaining for the next phase:

YouTube transcript extraction (API quota management)
FTS5 search at 500K entries (load test)
WCAG accessibility (zero ARIA attributes currently)
OpenAPI specification (219+ endpoints, no formal docs)
CI/CD pipeline (still deploying manually via docker compose)

UAT is next. Real content going out on real channels with real stakeholders watching. Not test posts to r/test — actual content on the ORCHESTRATE Method LinkedIn page, actual articles on Dev.to, actual podcast episodes in the RSS feed.

The scheduler is running right now. It publishes every 60 seconds. There are 554 posts in the queue across four brands. The next one goes out tonight.

If something breaks, we'll know — because this time, we actually tested it.

This is part of the ORCHESTRATE Build Log, an ongoing series documenting the construction of an AI-powered content marketing platform. The platform uses the ORCHESTRATE framework (the book: amazon.com/dp/B0G2BJKDM6) to structure every AI interaction. Previous entries: "5,575 Tests and Zero Proof" and "The Agent That Doesn't Write Code."