We kept breaking production workflows with prompt changes — so we started treating prompts as code

#ai #opensource #discuss #promptengineering

Hey folks,

At the beginning of 2024, we were working as a service company for enterprise customers with a very concrete request:
automate incoming emails → contract updates → ERP systems.

The first versions worked.
Then, over time, they quietly stopped working.

And not just because of new edge cases or creative wording.

Emails we had already processed correctly started failing again.
The same supplier messages produced different outputs weeks later.
Minor prompt edits broke unrelated extraction logic.
Model updates changed behavior without any visible signal.
And business rules ended up split across prompts, workflows, and human memory.

In an ERP context, this is unacceptable — you don’t get partial credit for “mostly correct”.

We looked for existing tools that could stabilize AI logic under these conditions. We didn’t find any that handled:

regression against previously working inputs
controlled evolution of prompts
decoupling AI logic from automation workflows
explainability when something changes

So we did what we knew from software engineering and automation work:
we treated prompts as business logic, and built a continuous development, testing, and deployment framework around them.

That meant:

versioned prompts
explicit output schemas
regression tests against historical inputs
model upgrades treated as migrations, not surprises
and releases that were blocked unless everything still worked

By late 2024, this approach allowed us to reliably extract contract updates from unstructured emails from over 100 suppliers into ERP systems with ** 100% signal accuracy**.

Our product is now deployed across multiple enterprises in 2025.
We’re sharing it as open source because this problem isn’t unique to us — it’s what happens when LLMs leave experiments and enter real workflows.

You can think of it like cursor for prompts + GitHub + Execution and Integration Environment

The mental model that finally clicked for us wasn’t “prompt engineering”, but prompt = code.

Patterns that actually mattered for us

These weren’t theoretical ideas — they came from production failures:

Narrow surface decomposition One prompt = one signal. No “do everything” prompts. Boolean / scalar outputs instead of free text.
Test before production (always) If behavior isn’t testable, it doesn’t ship. No runtime magic, no self-healing agents.
Decouple AI logic from workflows Prompts don’t live inside n8n / agents / app code. Workflows call versioned prompt releases.
Model changes are migrations, not surprises New model → rerun regressions offline → commit or reject. This approach is already running in several enterprise deployments. One example: extracting business signals from incoming emails into ERP systems with 100% signal accuracy at the indicator level (not “pretty text”, but actual machine-actionable flags).

What Genum is (and isn’t)

Open source (on-prem)
Free to use (SaaS optional, lifetime free tier)
Includes a small $5 credit for major model providers so testing isn’t hypothetical
Not a prompt playground
Not an agent framework
Not runtime policy enforcement

It’s infrastructure for making AI behavior boring and reliable.

If you’re:

shipping LLMs inside real systems
maintaining business automations
trying to separate experimental AI from production logic
tired of prompts behaving like vibes instead of software

we’d genuinely love feedback — especially critical feedback.

Links (if you want to dig in):

Repo: https://github.com/genumai/
Docs: https://genum.ai/docs
Website: https://genum.ai
YouTube (patterns & deep dives): https://www.youtube.com/@Genum-ai
We are looking for advisors: https://cdn.genum.ai/docs/advisor_pitch.pdf

We’re not here to sell anything — this exists because we needed it ourselves.
Happy to answer questions, debate assumptions, or collaborate with people who are actually running this stuff in production.

— The Genum team