We tried every multi-agent framework — then built our own runtime

#ai #opensource #tutorial #agents

Why we gave up on framework-as-library and built Orloj, an open-source orchestration runtime for multi-agent AI systems, with governance enforced at runtime and declarative YAML manifests inspired by Kubernetes.

The wall every framework hit

Last year my co-founder Kristiane and I tried to build a production multi-agent system. The use case was boring: a few agents that research, write, review, and publish. The part that wasn't boring was trying to actually run it.

We tried most of the popular frameworks. Every single one broke down in the same places:

Governance was "trust the prompt". Tool permissions lived inside system prompts. One clever jailbreak and the whole thing was wide open.
State was somebody else's problem. Worker crashes left orphan tasks. Restarts double-ran jobs. There was no durable task queue, no concept of ownership.
Everything was glued together in Python. Want a non-Python team to edit a workflow? Tough. Want to swap a tool implementation without touching the agent graph? Tough.
MCP and CLI tools were an afterthought. We use a lot of MCP servers. Managing, and bolting them into every framework felt like a part-time job.

None of this is a knock on those frameworks. They're great for prototyping and small projects. But we wanted something we could hand to a platform team and say "run this in prod." It didn't exist, so we built it.

Meet Orloj

Orloj is an open-source orchestration runtime for multi-agent AI systems. You declare your agents, tools, policies, and workflows in YAML manifests, and Orloj handles scheduling, execution, governance, and reliability. We call it Agent-Infrastructure-as-Code.

If you've used Kubernetes or Terraform, the pattern is going to feel familiar, because that's exactly what inspired us. I've been writing Kubernetes manifests and Terraform modules for years, and the declarative, resource-based model just works for distributed systems. Turns out agent systems are distributed systems with extra steps.

Here's a minimal agent manifest:

apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: telegram-bot-writer-agent
spec:
  model_ref: anthropic-default
  prompt: |
    You are the writer and delivery stage of a Telegram content bot
  tools:
    - telegram-send-message
  limits:
    max_steps: 4
    timeout: 30s

And here's what ties a system together:

apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: telegram-bot-system
  labels:
    orloj.dev/pattern: pipeline
spec:
  agents:
    - telegram-bot-parser-agent
    - telegram-bot-planner-agent
    - telegram-bot-research-agent
    - telegram-bot-writer-agent
  graph:
    telegram-bot-parser-agent:
      edges:
        - to: telegram-bot-planner-agent
    telegram-bot-planner-agent:
      edges:
        - to: telegram-bot-research-agent
    telegram-bot-research-agent:
      edges:
        - to: telegram-bot-writer-agent

That's it. orlojctl apply -f ./my-system and the scheduler picks it up.

Three things we think we got right

1. Governance is a runtime gate, not a prompt instruction
This was the biggest unlock for us. Instead of trusting the model to follow instructions like "don't call the delete API", we treat policies as first-class resources:

apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-policy
spec:
  apply_mode: scoped
  target_systems:
    - report-system
  max_tokens_per_run: 50000
  allowed_models:
    - gpt-4o
  blocked_tools:
    - filesystem_delete

Every agent turn and every tool call hits the policy engine before it runs. Unauthorized actions fail closed with a structured error and a full audit trail. If the model tries to call a blocked tool, it doesn't get a polite refusal, it gets a hard stop and a trace entry.

For anyone who's ever tried to explain to a security team "we ask the AI nicely not to exfiltrate data" knows the feeling.

2. Lease-based task ownership, borrowed from distributed systems
Workers claim tasks with a lease. If a worker crashes mid-task, the lease expires and another worker picks it up. No orphans, double-runs. Same pattern Kubernetes uses for controller leader election.

The practical effect: you can run orlojworker on whatever compute you want, including heterogeneous fleets. We run some workers on CPU boxes for cheap tool calls and some on GPU boxes for local model inference. The scheduler doesn't care.

3. Complete audit trails
Orloj keeps a complete audit trail via traces to understand and debug how your system is running every step of the way. See how many tokens (input/output) were used per agent. Track which tools were called when and even check the latency throughout.

4. MCP servers & CLI's are first-class resources

Register an MCP server (container, stdio or HTTP) and Orloj auto-discovers its tools. They become resources you can reference in manifests, attach policies to, and audit like anything else. No adapter layer, no wrapper code. Containerized tools just make sense. Why run a MCP server with 24/7 uptime when I may just need it a few times a day? Your MCP server spins up when needed and down when idol.

apiVersion: orloj.dev/v1
kind: McpServer
metadata:
  name: gmail
spec:
  transport: stdio
  image: mcp/gmail
  idle_timeout: "5m"
  env:
    - name: GMAIL_OAUTH_PATH
      secretRef: gmail-creds/oauth_keys
      mountPath: /secrets/gcp-oauth.keys.json
    - name: GMAIL_CREDENTIALS_PATH
      secretRef: gmail-creds/credentials
      mountPath: /secrets/credentials.json

After that, every tool the MCP server exposes is usable by agents — and governable by policies.

What we're still figuring out

Orloj is still in Beta so here are honest tradeoffs today:

The ecosystem is tiny compared to LangGraph or CrewAI. We have examples, templates, and Python/JS SDK's but no massive community contributing yet.
We don't have a hosted platform. You run orlojd and orlojworker yourself. Some teams love this, some want a managed option which is coming in the future.
The declarative model is great for stable workflows and less great for highly dynamic agent graphs that rewrite themselves at runtime. We're working on it.

We'd rather ship these honestly than pretend they're not tradeoffs.

Try it

Everything is Apache 2.0 and on GitHub. The quickstart is a few commands:
Install the cli:

brew tap OrlojHQ/orloj
brew install orlojctl

Install server/workers

curl -sSfL https://raw.githubusercontent.com/OrlojHQ/orloj/main/scripts/install.sh | ORLOJ_BINARIES="orlojd" sh

curl -sSfL https://raw.githubusercontent.com/OrlojHQ/orloj/main/scripts/install.sh | ORLOJ_BINARIES="orlojworker" sh

And once running you can go to localhost:8080 to see the webui and even manage your systems from there.

If you've hit the same walls we did, I'd love to hear what you're building. What would you use this for? What's missing? Drop a comment or open an issue.