Why we gave up on framework-as-library and built Orloj, an open-source orchestration runtime for multi-agent AI systems, with governance enforced at runtime and declarative YAML manifests inspired by Kubernetes.
The wall every framework hit
Last year my co-founder Kristiane and I tried to build a production multi-agent system. The use case was boring: a few agents that research, write, review, and publish. The part that wasn't boring was trying to actually run it.
We tried most of the popular frameworks. Every single one broke down in the same places:
- Governance was "trust the prompt". Tool permissions lived inside system prompts. One clever jailbreak and the whole thing was wide open.
- State was somebody else's problem. Worker crashes left orphan tasks. Restarts double-ran jobs. There was no durable task queue, no concept of ownership.
- Everything was glued together in Python. Want a non-Python team to edit a workflow? Tough. Want to swap a tool implementation without touching the agent graph? Tough.
- MCP and CLI tools were an afterthought. We use a lot of MCP servers. Managing, and bolting them into every framework felt like a part-time job.
None of this is a knock on those frameworks. They're great for prototyping and small projects. But we wanted something we could hand to a platform team and say "run this in prod." It didn't exist, so we built it.
Meet Orloj
Orloj is an open-source orchestration runtime for multi-agent AI systems. You declare your agents, tools, policies, and workflows in YAML manifests, and Orloj handles scheduling, execution, governance, and reliability. We call it Agent-Infrastructure-as-Code.
If you've used Kubernetes or Terraform, the pattern is going to feel familiar, because that's exactly what inspired us. I've been writing Kubernetes manifests and Terraform modules for years, and the declarative, resource-based model just works for distributed systems. Turns out agent systems are distributed systems with extra steps.
Here's a minimal agent manifest:
apiVersion: orloj.dev/v1
kind: Agent
metadata:
name: telegram-bot-writer-agent
spec:
model_ref: anthropic-default
prompt: |
You are the writer and delivery stage of a Telegram content bot
tools:
- telegram-send-message
limits:
max_steps: 4
timeout: 30s
And here's what ties a system together:
apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
name: telegram-bot-system
labels:
orloj.dev/pattern: pipeline
spec:
agents:
- telegram-bot-parser-agent
- telegram-bot-planner-agent
- telegram-bot-research-agent
- telegram-bot-writer-agent
graph:
telegram-bot-parser-agent:
edges:
- to: telegram-bot-planner-agent
telegram-bot-planner-agent:
edges:
- to: telegram-bot-research-agent
telegram-bot-research-agent:
edges:
- to: telegram-bot-writer-agent
That's it. orlojctl apply -f ./my-system and the scheduler picks it up.
Three things we think we got right
1. Governance is a runtime gate, not a prompt instruction
This was the biggest unlock for us. Instead of trusting the model to follow instructions like "don't call the delete API", we treat policies as first-class resources:
apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
name: cost-policy
spec:
apply_mode: scoped
target_systems:
- report-system
max_tokens_per_run: 50000
allowed_models:
- gpt-4o
blocked_tools:
- filesystem_delete
Every agent turn and every tool call hits the policy engine before it runs. Unauthorized actions fail closed with a structured error and a full audit trail. If the model tries to call a blocked tool, it doesn't get a polite refusal, it gets a hard stop and a trace entry.
For anyone who's ever tried to explain to a security team "we ask the AI nicely not to exfiltrate data" knows the feeling.
2. Lease-based task ownership, borrowed from distributed systems
Workers claim tasks with a lease. If a worker crashes mid-task, the lease expires and another worker picks it up. No orphans, double-runs. Same pattern Kubernetes uses for controller leader election.
The practical effect: you can run orlojworker on whatever compute you want, including heterogeneous fleets. We run some workers on CPU boxes for cheap tool calls and some on GPU boxes for local model inference. The scheduler doesn't care.
3. Complete audit trails
Orloj keeps a complete audit trail via traces to understand and debug how your system is running every step of the way. See how many tokens (input/output) were used per agent. Track which tools were called when and even check the latency throughout.
4. MCP servers & CLI's are first-class resources
Register an MCP server (container, stdio or HTTP) and Orloj auto-discovers its tools. They become resources you can reference in manifests, attach policies to, and audit like anything else. No adapter layer, no wrapper code. Containerized tools just make sense. Why run a MCP server with 24/7 uptime when I may just need it a few times a day? Your MCP server spins up when needed and down when idol.
apiVersion: orloj.dev/v1
kind: McpServer
metadata:
name: gmail
spec:
transport: stdio
image: mcp/gmail
idle_timeout: "5m"
env:
- name: GMAIL_OAUTH_PATH
secretRef: gmail-creds/oauth_keys
mountPath: /secrets/gcp-oauth.keys.json
- name: GMAIL_CREDENTIALS_PATH
secretRef: gmail-creds/credentials
mountPath: /secrets/credentials.json
After that, every tool the MCP server exposes is usable by agents — and governable by policies.
What we're still figuring out
Orloj is still in Beta so here are honest tradeoffs today:
- The ecosystem is tiny compared to LangGraph or CrewAI. We have examples, templates, and Python/JS SDK's but no massive community contributing yet.
- We don't have a hosted platform. You run
orlojdandorlojworkeryourself. Some teams love this, some want a managed option which is coming in the future. - The declarative model is great for stable workflows and less great for highly dynamic agent graphs that rewrite themselves at runtime. We're working on it.
We'd rather ship these honestly than pretend they're not tradeoffs.
Try it
Everything is Apache 2.0 and on GitHub. The quickstart is a few commands:
Install the cli:
brew tap OrlojHQ/orloj
brew install orlojctl
Install server/workers
curl -sSfL https://raw.githubusercontent.com/OrlojHQ/orloj/main/scripts/install.sh | ORLOJ_BINARIES="orlojd" sh
curl -sSfL https://raw.githubusercontent.com/OrlojHQ/orloj/main/scripts/install.sh | ORLOJ_BINARIES="orlojworker" sh
And once running you can go to localhost:8080 to see the webui and even manage your systems from there.
- Repo: https://github.com/OrlojHQ/orloj
- Docs: https://docs.orloj.dev
- Website: https://orloj.dev
If you've hit the same walls we did, I'd love to hear what you're building. What would you use this for? What's missing? Drop a comment or open an issue.
Top comments (0)