DEV Community

Nguyen Thien
Nguyen Thien

Posted on • Originally published at beevr.ai

We open-sourced Kite, our agent framework. Here is what building production agents taught us.

Everyone has an agent demo in 2026. Far fewer have agents they would put in front of a paying customer, an auditor, or a patient. The gap between "it worked in the notebook" and "it works every time, safely, and we can explain what it did" is where most agent projects quietly die, and it is the gap we built Kite to close.

We just open-sourced it: https://github.com/beevr-labs/Kite. It is Python, MIT licensed, and pip install kite-agent away. This is the honest writeup of why it exists and what we learned.

The problem Kite solves

We build production software for regulated industries, so we kept hitting the same wall: the popular agent frameworks are great for a prototype and painful for production. Getting to a first working agent in LangChain or AutoGen is a configuration project, and once you are there you still have to bolt on the parts that actually matter in production: guardrails, retries, idempotency, observability, evaluation. We were rebuilding that same scaffolding for every client. Kite is the framework we wish we had started with: opinionated about safety, fast to a running agent, and small enough to read.

The one design decision everything hangs on: treat the LLM as untrusted

This is the core idea. In Kite, the model proposes actions, it does not execute them. A controlled kernel sits between the agent and the real world and validates every proposed action against policy before anything runs. So when an agent decides to call agent.run("rm -rf /"), the kernel refuses it instead of your filesystem finding out the hard way.

It sounds simple. It changes everything about how comfortable you are giving an agent real tools. The model becomes a planner you can sandbox, not a process with your credentials. For anyone running agents on sensitive data or real infrastructure, that boundary is the difference between a demo and something you can actually deploy.

What you get out of the box

  • Five reasoning patterns, selectable per agent: ReAct (think, act, observe), ReWOO (plan upfront and run steps in parallel, which Kite clocks at roughly 2x faster), Tree of Thoughts (explore multiple paths), Plan-Execute (decompose and replan on failure), and Reflective (generate, critique, improve).
  • Production safety primitives: a circuit breaker that stops cascading failures, a kill switch (per-agent or global) for when you need everything to stop now, and idempotency keyed on operation IDs so a retried action does not charge a customer twice.
  • Retrieval that is not a toy: HyDE, hybrid BM25 plus vector search, MMR deduplication, and reranking.
  • Prompt A/B testing with statistical confidence intervals on real traffic, because "the new prompt feels better" is not a deployment criterion.

What it looks like

The fastest path is the generator. Describe the agent, get a runnable file:

pip install kite-agent
export GROQ_API_KEY=your_key
kite generate "research assistant that searches and summarizes" --out agent.py
python agent.py
Enter fullscreen mode Exit fullscreen mode

Or build one directly in Python and pick the reasoning pattern:

from kite import Kite

ai = Kite()
agent = ai.create_agent(name="Bot", agent_type="react")
result = await agent.run("user request")
Enter fullscreen mode Exit fullscreen mode

Kite's own benchmarks put time to first agent at under a minute (versus roughly 30 minutes for LangChain and 20 for AutoGen in their tests) and cold startup around 50ms (versus ~2s and ~1s). Take the comparison as the authors' figures, not an audit, but the design intent is clear: get to a safe, running agent fast.

What we learned running agents in production

  • The model is about 10% of the work. The other 90% is tools, retries, guardrails, idempotency, and evaluation. A better model does not save you from a missing kill switch.
  • Most "agent failures" are IO failures in disguise. A flaky tool, a duplicated side effect, a partial write. Observability and idempotency beat another round of prompt tuning almost every time.
  • The untrusted-component framing is freeing, not limiting. Once the kernel is the thing that says yes or no, you stop being afraid to hand the agent real capabilities.

Why we open-sourced it

In a field full of black boxes, "you can read the code" is a differentiator, not a giveaway. We build production AI for regulated industries, and the way we earn a technical buyer's trust is by letting them inspect the hardest parts of our stack instead of taking a pitch on faith.

Kite is MIT licensed and lives at https://github.com/beevr-labs/Kite. Issues and PRs welcome. If you are building production-grade or compliance-bound AI and want a partner who ships the boring 90%, here is how we work.

What are you using to build agents in production, and what keeps breaking? Curious where Kite would and would not help.

Originally published on beevr.ai.

Top comments (0)