brainbootdev

Posted on Apr 16

Why I stopped writing prompts and started compiling them

#saas #ai #webdev #programming

I've been building AI-powered tools for a while and I kept hitting the same wall. Not the models — the models are fine. The problem is how we use them.

Every interaction with an AI starts from zero. You open a new conversation, paste your system prompt, explain your context, get partway through something useful, watch the model drift off course, hit the context limit, and then start over. I tracked my own usage for a month and roughly 40% of my tokens were spent re-establishing context that
the model already knew five minutes ago.

That's not a model problem. That's an architecture problem. We're treating prompts like disposable messages when they should be treated like software.

What "prompts as software" actually means

Think about what makes code reliable. Types enforce structure. Tests verify behavior. Modules compose into larger systems. Error handling catches failures before they propagate.

Now think about what prompts have. None of that. A prompt is a string you paste into a text box and hope works. There's no type checking on what comes back. There's no test suite. There's no way to compose two prompts together with a contract between them. If the output is garbage you just try again and burn more tokens.

I built Brainboot to close that gap. The core primitive is a "brain" — a prompt wrapped in actual engineering.

The anatomy of a brain

A brain declares three things beyond the prompt itself.

Typed inputs and outputs. The brain specifies what shape of data it accepts and what shape it returns. The runtime validates both directions. If you ask for JSON and the model returns markdown, the runtime catches it and retries automatically. You never burn tokens processing malformed output downstream.

Invariants. These are rules enforced on every execution at the wrapper layer. Not instructions in the system prompt that the model might follow. Actual guardrails that the runtime checks against the output before it's returned. Things like "never include placeholder text" or "all URLs must be valid" or "output must be valid TypeScript." The model cannot drift past these.

Test suites. Every brain has tests that run across multiple models — currently GPT-5, Claude Opus, and Gemini Ultra. Pass rates are public. You know whether a brain works reliably before you use it in production. You don't have to re-test it every time.

Composition is where it gets interesting

Small brains chain into pipelines. A research brain feeds an outline brain feeds a drafting brain feeds a quality gate. Each brain in the chain gets exactly the context it needs — not your entire conversation history. The output of brain A is type-checked before it becomes the input of brain B.

This solves the token problem at a fundamental level. A monolithic conversation carrying 50K tokens of history to maintain context gets replaced by a chain of focused brains each using maybe 2-5K tokens of precisely scoped context. Same outcome, fraction of the token cost, and no drift because invariants catch it at every step.

It also makes failures cheap. If brain 4 in a 6-brain pipeline fails, you retry brain 4 with its specific input. You don't re-run brains 1 through 3. In a monolithic conversation, any failure means starting the whole thing over.

The compiler

The most interesting feature is the compiler. You describe what you want in plain English — something like "weekly content pipeline for B2B SaaS" — and it runs a four-stage process.

Stage 1, Decompose — breaks your intent into atomic capabilities. "This needs keyword research, topic clustering, outline generation, drafting, SEO optimization, and quality review."

Stage 2, Map — searches the existing brain catalog for brains that match each capability. "Keyword Research Brain covers capability 1. Cluster Architecture Brain covers capability 2."

Stage 3, Synthesize — wires the matched brains into a composition graph with typed connections between them. The output schema of each brain is validated against the input schema of the next.

Stage 4, Audit — runs the complete composition through a quality check and grades it A through F with a deploy recommendation.

The output isn't a prompt. It's a deployable system. Describe once, compile once, run forever.

What this looks like in production

I have two circuits running in production right now.

Content Empire runs 6 brains in a pipeline — Keyword Research, Cluster Architecture, Outline, Draft, On-Page SEO, Quality Gate. It produces 150 SEO-optimized pages per quarter, 13 weekly newsletters, 12 authority essays, 3 full SEO audits, and 500+ platform adaptations. It runs on a schedule via Vercel Cron with zero manual intervention. $299/month versus roughly $25K/month for the equivalent human team.

Sales Engine runs 5 blueprints handling prospecting, research, personalized outreach, reply handling, and pipeline intelligence. 500 qualified prospects per month with individualized research and multi-step sequences. $399/month versus roughly $6,600/month for an SDR.

To validate the architecture on something with objectively verifiable results I also pointed a circuit at MLB predictions — a 6-layer probability engine predicting scoreless innings. It's been running autonomously since opening day. 150+ verified picks at 89% accuracy. Completely unrelated to the core business but it proved that composed brains with invariants work on real-world problems with measurable outcomes. Full public track record at
brainboot.dev/labs/nrfi/results.

The stack

Next.js 16 with App Router and React 19. Supabase for auth and Postgres. Vercel for deployment and cron jobs. Stripe for billing. Resend for transactional email. Vercel AI SDK v6 with AI Gateway for multi-model routing.

The brain runtime handles type validation, invariant enforcement, retry logic, and telemetry. Every execution produces a cognitive trace showing which models, rules, and brains contributed to which decisions.

Total infrastructure cost is roughly $50/month. Solo built, no funding.

Try it

The platform is live at https://brainboot.dev. Free tier has 200+ curated prompts with no account required. The compiler, brains, and circuits are where it gets interesting.

The manifesto explaining the full "prompts are software" philosophy is at https://brainboot.dev/manifesto.

I'm interested in how other people are approaching the reliability and token efficiency problem. If you've built
something similar or taken a different approach I'd genuinely like to hear about it.

Top comments (2)

Hollow House Institute • Apr 17

This is a strong architecture approach, especially around invariants and typed outputs.

But it still assumes control is enforced before or after execution.

That’s where systems tend to drift in production.

At runtime:

Invariant failure → should trigger escalation, not just retry
Invalid output shape → should pause downstream execution
Repeated failure → should invoke stop authority
Cross-step inconsistency → should route to human-in-the-loop

Without that, the system can remain structurally correct while behavior still drifts over time.

That’s the difference between validation and governance.

brainbootdev • Apr 17 • Edited

Really appreciate this, you're touching on exactly the right distinction.

You're right that validation and governance are different layers. What we've built handles the first three of your four points at the invariant enforcement layer:

Invariant failure → we do escalate, not just retry. The invariant layer rejects the output and the brain either re-prompts with the failure context or routes to a fallback brain. Repeated failures trigger a circuit breaker that kills the pipeline. - Invalid output shape → downstream execution is blocked by the typed output contract. If the shape doesn't match, the next brain in the circuit never receives it. It's not a pause it's a hard gate. - Repeated failure → the circuit breaker pattern handles this. Three consecutive failures on a brain → the circuit
halts and logs the failure chain for review.

The fourth point cross-step inconsistency routing to human-in-the-loop is where it gets interesting. We've debated this internally.

The tension is: if you require human intervention for cross-step drift, you've reintroduced the bottleneck that the multi-brain architecture was designed to eliminate.

Our current approach is to detect drift via telemetry (each brain's output is scored against the original intent hash) and flag it for async review rather than synchronous blocking. The system continues but the drift is surfaced.

That said you're pointing at the real frontier here. Governance at runtime, not just at design time. We're actively thinking about what a "stop authority" looks like when you have 6 brains operating in a pipeline and brain 4 starts producing subtly wrong outputs that pass structural validation. That's the hard problem.

Would love to hear more about how you're approaching this. Are you building in this space?