Hey folks,
At the beginning of 2024, we were working as a service company for enterprise customers with a very concrete request:
automate incoming emails → contract updates → ERP systems.
The first versions worked.
Then, over time, they quietly stopped working.
And not just because of new edge cases or creative wording.
Emails we had already processed correctly started failing again.
The same supplier messages produced different outputs weeks later.
Minor prompt edits broke unrelated extraction logic.
Model updates changed behavior without any visible signal.
And business rules ended up split across prompts, workflows, and human memory.
In an ERP context, this is unacceptable — you don’t get partial credit for “mostly correct”.
We looked for existing tools that could stabilize AI logic under these conditions. We didn’t find any that handled:
- regression against previously working inputs
- controlled evolution of prompts
- decoupling AI logic from automation workflows
- explainability when something changes
So we did what we knew from software engineering and automation work:
we treated prompts as business logic, and built a continuous development, testing, and deployment framework around them.
That meant:
- versioned prompts
- explicit output schemas
- regression tests against historical inputs
- model upgrades treated as migrations, not surprises
- and releases that were blocked unless everything still worked
By late 2024, this approach allowed us to reliably extract contract updates from unstructured emails from over 100 suppliers into ERP systems with ** 100% signal accuracy**.
Our product is now deployed across multiple enterprises in 2025.
We’re sharing it as open source because this problem isn’t unique to us — it’s what happens when LLMs leave experiments and enter real workflows.
You can think of it like cursor for prompts + GitHub + Execution and Integration Environment
The mental model that finally clicked for us wasn’t “prompt engineering”, but prompt = code.
Patterns that actually mattered for us
These weren’t theoretical ideas — they came from production failures:
- Narrow surface decomposition One prompt = one signal. No “do everything” prompts. Boolean / scalar outputs instead of free text.
- Test before production (always) If behavior isn’t testable, it doesn’t ship. No runtime magic, no self-healing agents.
- Decouple AI logic from workflows Prompts don’t live inside n8n / agents / app code. Workflows call versioned prompt releases.
- Model changes are migrations, not surprises New model → rerun regressions offline → commit or reject. This approach is already running in several enterprise deployments. One example: extracting business signals from incoming emails into ERP systems with 100% signal accuracy at the indicator level (not “pretty text”, but actual machine-actionable flags).
What Genum is (and isn’t)
- Open source (on-prem)
- Free to use (SaaS optional, lifetime free tier)
- Includes a small $5 credit for major model providers so testing isn’t hypothetical
- Not a prompt playground
- Not an agent framework
- Not runtime policy enforcement
It’s infrastructure for making AI behavior boring and reliable.
If you’re:
- shipping LLMs inside real systems
- maintaining business automations
- trying to separate experimental AI from production logic
- tired of prompts behaving like vibes instead of software
we’d genuinely love feedback — especially critical feedback.
Links (if you want to dig in):
- Repo: https://github.com/genumai/
- Docs: https://genum.ai/docs
- Website: https://genum.ai
- YouTube (patterns & deep dives): https://www.youtube.com/@Genum-ai
- We are looking for advisors: https://cdn.genum.ai/docs/advisor_pitch.pdf
We’re not here to sell anything — this exists because we needed it ourselves.
Happy to answer questions, debate assumptions, or collaborate with people who are actually running this stuff in production.
— The Genum team
Top comments (0)