Why we built a programming language for AI prompts instead of another GUI

#webdev #llm #opensource #ai

The first version of our AI product had this in the codebase:

system_prompt = f"""
You are a customer support agent for {company}.
{premium_instructions if tier == 'premium' else free_instructions}
{billing_policy if issue_type in ['billing', 'refund'] else ''}
...12 more conditional blocks...
"""

This works until it doesn't. By month three we had:

4,000-token prompts being sent unconditionally
Conditional logic scattered across Python files, config files, and Notion docs
No way to test a prompt change without deploying the app
No version history - just vibes and git blame

We looked at every existing tool. They all had one thing in common: they stored prompts as strings with variable substitution. That doesn't solve the problem. It just moves the string somewhere else.

The actual problem: prompts need real conditional logic

The LLM doesn't need to see instructions for premium users when the current user is free. It doesn't need the billing policy when the question is technical. Every irrelevant section is tokens the model has to cognitively weight around - and we're paying for all of it.

We built Echo PDK - a DSL that evaluates if/else logic server-side before the prompt ever reaches the model.

[#ROLE system]
You are a support agent for {{company}}.

[#IF {{tier}} #equals(Premium)]
Priority customer. Offer callback, skip escalation.
[END IF]
[END ROLE]

[#IF {{issue}} #one_of(billing, refund)]
[#INCLUDE billing_policies]
[END IF]

[#ROLE user]
{{question}}
[END ROLE]

The render engine evaluates the conditionals, substitutes variables, and returns the resolved messages array. The LLM sees only the output - never the template logic.

What we learned building a DSL from scratch

The parser is built on Chevrotain. We spent a month on the operator design - the tension between readable names (#contains, #greater_than) for non-engineers and short aliases (#gt, #in) for developers. We ship both.

The hardest design decision was the #ai_gate operator - an LLM-evaluated condition. Useful for "if the user's message is angry, include de-escalation instructions" but adds latency and cost. We ship it but recommend using it sparingly.

The meta template feature - where model selection and temperature can be conditional - was an accident. We were refactoring and realized there was no reason model config had to be hardcoded in application code. Now the prompt decides: creative tasks get gpt-4o at 0.9, extraction gets gpt-4o-mini at 0.2.

Token reduction in practice

In production, conditional rendering dropped our average input tokens from ~4,100 to ~1,200 across all request types. The quality improvement was unexpected but real - smaller models especially seem to benefit from receiving only relevant context.

Open source

Echo PDK is MIT licensed: github.com/GoReal-AI/echo-pdk. The plugin system lets you add custom operators - #isValidEmail, #isEmpty, domain-specific validators. We built a hosted layer (EchoStash) on top for version control and evals, but the DSL works fully standalone.

What's your approach to prompt management in production? Curious whether others have hit the same walls.