Forrest Miller

Posted on May 14 • Originally published at bingwow.com

We Built a Compound AI System Instead of an Agent. It Costs $200/month and 100k People Use It.

#architecture #ai #webdev #softwareengineering

The architecture nobody is marketing

I just wrote in The AI Journal about why our autonomous AI agent ran for six months, cost $310 in API charges, and produced zero new dofollow backlinks. The reasons generalize: Gartner predicts more than 40% of agentic AI projects will be canceled by end of 2027; McKinsey finds 73% of enterprise AI projects fail to deliver ROI; Writer reports 88% of AI agent pilots never reach production.

Berkeley AI Research named the alternative in February 2024: the Compound AI System. Either control flow is written in traditional code that calls LLMs at specific bounded steps, or control flow is driven by an LLM that decides what to do next. Compound systems pick the first. Agents pick the second.

This post is the implementation detail of the working alternative.

The six models inside BingWow

BingWow is a free AI bingo card platform used by classrooms and HR teams. Six models from four vendors handle different parts of the pipeline:

Claude Sonnet 4.5 — content quality judgment, generation
Claude Haiku 4.5 — classification: moderation, dedup, categorization
Gemini 2.5 Flash — bingo-clue generation
Gemini 3 Flash Preview + Gemini 2.5 Pro fallback — themed display names per card topic
GPT-4o Vision — background image description (accessibility, search)
Replicate Flux Schnell — background image generation

None of these models decides what happens next. Code decides.

The pipeline is code

Every transition between models is a TypeScript function, a SQL query, or a cron job. Here is the pipeline from the moment a visitor types a card topic to the moment that card is browsable on bingwow.com/cards:

// app/api/cron/process-pending-topics/route.ts (06:00 UTC daily)
export async function GET(req: NextRequest) {
  requireCron(req);
  const topics = await getPendingTopics({ limit: BATCH_SIZE });

  for (const topic of topics) {
    // Step 1: Gemini 2.5 Flash generates clues (structured-output schema)
    const clues = await generateClues(topic);

    // Step 2: SQL deduplicates against existing cards in same category
    const isDuplicate = await findSemanticDuplicate(topic.category_id, clues);
    if (isDuplicate) { await markRejected(topic.id, 'duplicate'); continue; }

    // Step 3: Claude Haiku 4.5 categorizes (validated against DB read)
    const validCategoryIds = await getSubcategories();
    const suggestedId = await classifyToCategory(topic, clues, validCategoryIds);
    if (!isValidSubcategoryId(suggestedId)) { /* fallback */ }

    // Step 4: Claude Sonnet 4.5 makes the publishability call
    const { publish } = await decidePublishability(topic, clues);
    if (!publish) { await hardDelete(topic.id); continue; }

    // Step 5: Replicate Flux Schnell generates a background (4 attempts max)
    const backgroundId = await resolveNewCardBackgroundId({ topic, clues });

    // Step 6: insert card, flip status to 'published'
    await insertCard({ topic, clues, category_id: suggestedId, backgroundId });
  }
}

Six steps. Each one a code decision. The models do bounded work between the decisions.

What this buys

Auditable cost. Every API call has a named caller in the codebase. When the April 2026 Anthropic bill spiked, I found the offender by grepping for claude-3-5-opus in lib/*.ts and replacing it with claude-haiku-4-5 in three files. The bill dropped from $560 a month to between $170 and $245. The system generates 30,000 AI bingo cards a month at that cost.

$ grep -l "claude-3-5-opus" lib/*.ts
lib/moderation-prompt.ts
lib/categorize.ts
lib/dedupe.ts
$ sed -i '' 's/claude-3-5-opus/claude-haiku-4-5/g' lib/moderation-prompt.ts lib/categorize.ts lib/dedupe.ts

An agent burns the same $560 because routing is a code decision and the agent owned the decisions.

Auditable failure. When categorization started landing in the wrong subcategory in March, the bug was in a static fallback list in the moderation prompt — not in the model's judgment. The fix was to read the subcategory list from the categories table at request time and validate the AI's returned ID against the same DB read:

// lib/moderation-prompt.ts
export async function buildModerationPrompt(topic: Topic) {
  const subcategories = await getSubcategories(); // 5-min cached DB read
  const categoryList = subcategories
    .map(c => `${c.id}: ${c.name}`)
    .join('\n');
  return `[...prompt header...]

Pick a category ID from this list:
${categoryList}

Respond with { "suggested_category_id": "<id>" }`;
}

// lib/categories.ts
export function isValidSubcategoryId(id: string): boolean {
  const subcategories = getSubcategoriesSync();
  return subcategories.some(c => c.id === id && !c.is_parent);
}

The fix is in TypeScript, not in prompt engineering, because the failure was a code failure. There is no prompt edit that can recover from a stale fallback list — the data structure has to change.

Auditable evaluation. Tests exist for code. Every API route has a test fixture that calls it with a known input and asserts on the output shape. Continuous evaluation runs on every deploy. Drift on any axis triggers a code change, not a vibes-based prompt tweak.

The bingo caller is a worked example

The BingWow caller is the most-trafficked surface in the product. It supports 30-ball, 75-ball, and 90-ball bingo with voice calls, a flashboard, auto-draw, manual draw, and printable number cards. Every layer of it is a worked example of the compound pattern:

The voice that calls each ball is one of 331 pre-recorded MP3s. The choice to ship pre-recorded audio instead of synthesizing speech at call time is a code decision — Web Speech API drifts in pacing and pronunciation; recorded audio is identical every run.
The flashboard renders 75 cells in a deterministic layout. The bingo-detection logic is TypeScript; no LLM is asked whether a row is complete.
The card-validation flow (proving that a 5-character card code corresponds to a winning board) is a single SQL query plus a deterministic reconstructCard function. No model is asked to validate; the math is the contract.

If any of those layers were handed to an LLM with the framing "you are an autonomous bingo agent," the product would be slower, more expensive, and less reliable on every dimension.

What this does not buy

A compound system does not replace the engineer who writes the orchestration code. The shape of that engineer's day changes: instead of prompt-tuning a single multi-step plan, they are writing TypeScript that calls bounded models and writing tests that pin the boundary. That work is not glamorous; it does not show up in any vendor's pitch deck. There is no margin in selling code that calls Python functions.

If your team has the resources to staff one senior engineer plus a continuous evaluation discipline, you can ship a compound system today. The model choices in this post are deliberate and replaceable — a year from now the right routing might be different — but the architecture is durable.

Receipts

BingWow — the product the stack runs
BingWow Research Portal — open-licensed engagement research generated by the same stack
State of Team Building Games 2026 — recent research output
The AI Journal — I Built an AI Agent for $310. It Failed for the Same Reason Yours Will. — the companion editorial on why agents fail
Berkeley AI Research — The Shift from Models to Compound AI Systems — the paper that named the pattern (Zaharia, Khattab, Chen et al., February 2024)

Pick the architecture. Don't pick the marketing label. The compound AI system is the architecture nobody is marketing — and that is the point.

DEV Community