Building HELM: from "AI agents are powerful" to "AI agents need an execution boundary"

Ivan Peychev — Mon, 01 Jun 2026 19:27:13 +0000

Hey DEV, first post here. My name is Ivan, I'm a founder based in Toronto, Canada building open-source AI infrastructure under Mindburn Labs. I spend most of my time designing autonomous agent systems and thinking about what it actually means to deploy them safely. Glad to be here.

I'm building HELM, an execution authority layer for AI agents.

The simplest version of the idea is this:

Models propose. HELM governs execution. Every decision leaves proof.

That sentence took a long time to arrive at.

For a while, I thought the big opportunity in AI agents was better orchestration: better workflows, better chains, better planning, better memory, better autonomous loops. But the deeper I went, the more obvious the real bottleneck became.

The hard problem is not just getting an agent to decide what to do.

The hard problem is deciding whether it should be allowed to do it.

Because the moment an AI agent can touch a real tool, the risk changes completely. It is no longer just "did the model hallucinate?" It becomes:

Can it delete a database?
Can it send a customer email?
Can it export private data?
Can it issue a refund?
Can it modify cloud permissions?
Can it deploy code?
Can it trigger a workflow that a human or business system will treat as real?

That is the boundary I became obsessed with.

The original itch

I started from a frustration that I think a lot of builders are running into now.

AI agents are becoming easier to build. Every week there is a new framework, SDK, workflow engine, tool-calling protocol, hosted agent builder, or "AI employee" product.

But when you actually try to put agents near production systems, the same uncomfortable question appears:

Who is responsible for the side effect?

If the agent proposes something, that is one thing. If the agent executes something, that is another.

Most agent stacks blur those two things. The model reasons, plans, selects a tool, calls the tool, and the surrounding application tries to clean up the mess with logs, prompts, evals, dashboards, or human review.

That felt backwards.

I kept coming back to a simple systems principle:

Stochastic systems can propose. Deterministic systems should authorize execution.

That became the foundation for HELM.

What HELM is

HELM is not another agent framework.

It is not trying to replace LangChain, CrewAI, OpenAI Agents, Claude tool use, MCP servers, workflow builders, or internal automation.

HELM sits underneath those systems.

The first product is HELM AI Kernel, a local-first, OSS execution boundary for AI agents. It is designed to intercept tool calls and OpenAI-compatible requests before they become side effects, evaluate whether the action is allowed, and emit signed receipts and EvidencePacks that can be verified later.

In practical terms:

An agent proposes an action.
HELM evaluates the action.
HELM returns ALLOW, DENY, or ESCALATE.
The decision is recorded.
The proof can be reviewed later.

The second product path is HELM AI Enterprise Basic, which turns the local boundary into team-scale control: shared workspaces, approvals, governed actions, receipts, policies, API access, and short-retention evidence.

The larger commercial product is HELM AI Enterprise, which I think of as a Company AI OS: it makes company state queryable, detects should-vs-is drift, generates executable specs, routes approved work through the HELM boundary, and writes closure evidence back into the proof system.

The insight that changed the product

The early temptation was to describe HELM as "AI governance."

That was a mistake.

Developers do not wake up wanting governance. Engineering teams do not want a dashboard that tells them, after the fact, that an agent did something dangerous.

They want agents to move faster without giving them unchecked authority.

So the framing became sharper:

HELM AI Kernel is a fail-closed execution firewall for AI agents.

A firewall is understandable. Fail-closed is understandable. Execution is concrete. Receipts are concrete. Verification is concrete.

The goal is not "trust AI."

The goal is: Do not trust your agents. Verify their execution.

Why this matters now

The AI market is moving from generation to execution.

The first wave was text, code, chat, summarization, and copilots.

The next wave is agents that can actually do things: open PRs, change infrastructure, call APIs, move data, trigger business workflows, send messages, update CRMs, submit payments, and coordinate across tools.

That shift creates a new infrastructure problem.

When software becomes agentic, logs are not enough. Observability is not enough. Prompt rules are not enough. Human approval for every action is not enough.

You need an execution boundary. You need policy before the action, not only analysis after the action. You need proof that survives disputes, audits, incidents, and customer questions.

What I'm deliberately not building

HELM is not an agent orchestration platform. Orchestration decides what an agent should attempt. HELM decides what may execute.
HELM is not a generic AI governance dashboard. Dashboards show things. HELM governs side effects.
HELM is not an AGI operating system. That framing is vague and too far ahead of product reality.
HELM is not Kubernetes Helm. This is HELM AI Kernel by Mindburn Labs.

What has been hardest

The hardest part has not been writing a policy engine or drawing architecture diagrams.

The hardest part has been resisting category drift.

AI makes it very easy to sound bigger than you are. The market does not need another grand AI claim. It needs a clean answer to a concrete problem:

Can I let this agent touch real systems without giving it unchecked power?

That is the product I want to build.

What I'm learning

Developers respond better to security mechanisms than governance language. "Execution firewall" lands. "AI governance platform" does not.
Founders want velocity, not bureaucracy. The product has to feel like it helps ship faster.
Proof is a product surface. Receipts, EvidencePacks, offline verification, and replay are not backend details. They are how customers build confidence.
OSS cannot be a crippled teaser. HELM AI Kernel has to be useful on its own.

What I'm looking for

Builders who are putting agents near real tools and feeling the discomfort. Especially if you are working on:

AI agents that call internal APIs
MCP tool servers
AI coding agents with repo or CI access
Agentic workflows around customer data
Teams trying to move from prototype agents to production agents

The best conversations start with: "We want agents to do X, but we are scared they might do Y."

That is exactly the boundary HELM is being built for.

Repo: https://github.com/Mindburn-Labs/helm-ai-kernel

Happy to answer questions or just hear what problems you're running into.

DEV Community: Ivan Peychev