I'm writing this down before I lose the thread
I've been sketching an architecture for agent platforms for a while, and the pieces are starting to lock together. Time to get it out of my head.
Two reasons for writing.
First, selfish: good ideas rot in Slack DMs and half-finished notes. Writing forces the system to hold. If I can explain it, it's real.
Second, more honest: a lot of this is about to be obvious. The LLM gateway/proxy space has exploded. Kong, LiteLLM, Portkey, Gravitee, Harness, everyone's building some version of the control plane. My specific combination feels new, but "new" has a shelf life measured in weeks right now. Flag planted.
This is a build log, not a product announcement. I'm running a working version at workspace level with one agent end to end. Policy, router, RAG, tracing, CLI hub, all wired up. What's next is scaling the same pattern up to repo and company level. Some of what I say here is lived experience. Some is where I'm aiming.
Where do agents actually live?
This is the question that wouldn't let me go.
Everyone builds agents. Almost nobody talks about where they live. You watch a demo, someone runs a command, the agent does a thing, the demo ends, the agent evaporates. Cool. Now do that for a hundred engineers across twenty repos for six months.
The moment you try, you hit the questions nobody answers:
- Where does the agent live when it's not actively running?
- How do you do long-horizon work, overnight or across days?
- Can you reconfigure it while it's running, or do you have to kill it?
- When it goes off the rails, who's watching, with what information?
Nobody answers these because the answers commit you to a shape for the agent's world. Most frameworks punt. They give you the brain and hand you the body as an exercise. So I started designing the body.
The shape I landed on: agents live in workspaces. A workspace isn't a folder or a container or a session. It's a persistent, identity-bearing thing. It has a policy. It has inherited credentials. It attaches to repos. It has its own trace. When the agent stops, the workspace stays. When it resumes, everything is still there.
And a workspace is also a place a human can walk into and take over.
Human-in-the-workshop, not human-in-the-loop
Every enterprise AI deck has a slide that says "human in the loop" with a stick figure approving things. That framing is backwards, and it's why so many rollouts stall.
Human-in-the-loop treats the human as a supervisor. An approval checkpoint. But the real thing is that humans and agents are doing the same job, and they need to do it in the same place, with the same tools, policies, and traces.
If I design the workspace so a human can do everything the agent does (same CLI tools, same services, same tokens, same trace), then:
- Debugging a broken run means stepping in and finishing the task
- Refining the agent means demonstrating correct behavior, which feeds back into config
- Traces don't distinguish human from agent, because the workspace identity is the unit that matters
The workspace stops being a cage and becomes a shared workbench. A human-AI workshop, not a supervision hierarchy.
Why no skills
The industry has converged hard on "skills." Every framework ships with them. I'm not building them, and I want to say why, because it's the core of the bet.
A skill, as the industry defines it, is a prompt the agent judges whether to invoke. You write a structured markdown file, the agent scans available skills, and decides on judgment whether it applies. When it picks right, it feels magical. When it doesn't, you're debugging vibes.
Fine for exploratory work. Not fine for what I'm building.
I'm building a deterministic system that AI drives. The AI will make mistakes. That's a given, not a bug. The job of the platform is to contain those mistakes inside flows that are solid and traceable. The flow doesn't get to happen or not happen based on the agent's mood. The flow is the thing.
This is why the CLI Hub as a golden path matters so much. Every tool usage flows through a deterministic, traced, policy-enforced rail. I'm not hoping the agent picks the right skill. I'm constraining the ground it walks on. In my system the skill isn't a prompt. The skill is the system. The CLI. The policy cascade. The workflow. The router.
The industry's skill is a suggestion the agent considers. Mine is a rail the agent runs on.
This isn't anti-AI. The more the platform handles determinism, the more the AI can be creative inside it. Ground the agent, then let it run. Don't ask it to ground itself.
Tracing is the floor, not the ceiling
I used to think tracing was an observability concern. Something to add later, like metrics. I was wrong.
Tracing is the foundation because agents generate trust deficits faster than logging can clear them. A human who made a weird commit, you ask them in standup. An agent that made a weird commit at 3am using an inherited Jira token to move a ticket, the only way anyone understands what happened is if the trace is good enough to reconstruct it. The policy that was resolved. The credential that was used. The workflow step. The repo. The model.
Most gateway products trace the LLM call and stop. That's useless. The interesting question isn't "which model was called," it's "which workspace used which inherited credential at which step of which workflow against which repo." If your trace answers that sentence, you have governance. If it doesn't, you have a dashboard.
Here's what a trace looks like in the system I'm running today:
trace_id: 7f9c2a...
workspace: ws-backend-refactor-01
agent: claude-opus-4.7
operator: agent # or "human" when I step in
repo: inventory-service (policy layer: repo)
policy_source:
model_allowlist: company.default
jira_token: company.default
rag_scope: repo.override ← scoped to this repo
workflow_max_steps: workspace.override
span: workflow.refactor_auth
span: step.analyze_imports
span: router.resolve_policy (2ms)
span: rag.retrieve (docs: shared + repo, hit=4) (180ms)
span: llm.call (model: sonnet-4.6, tokens: 2.1k) (1.2s)
span: step.apply_changes
span: router.resolve_policy (1ms)
span: tool.git_commit (via cli-hub) (80ms)
span: service.jira.update (token: company.default) (220ms)
↑ this is attributable to ws-backend-refactor-01
even though the token is company-level
The key detail is the policy_source block. For every value this workspace resolved, I can see which level it came from. Model allowlist: company default. Jira token: company default. RAG scope: overridden at the repo level. That's not a log line. That's a governance story.
When I scale this to company level, that block gets more interesting. Right now every value resolves from company or workspace, because those are the two levels I've wired up. Once repos and workspace overrides cascade properly, this same trace starts showing "three of these values came from company, one was overridden by the repo, two were overridden by the workspace." Audit becomes a read, not a forensic archaeology.
RAG first, then lower
Most LLM platforms are organized around "which model do we call." Wrong primitive.
The primitive is "what knowledge does this request have access to." The model is a downstream detail.
When a request comes in, the first question isn't GPT or Claude. It's: should this pull context from shared docs? From this repo's docs? From the workspace's scratch space? Should the response be re-grounded against docs on the way out? What's this agent allowed to know, and what are we obligated to cite?
Answer those first, then lower the request into the model. If you build from the LLM up, RAG becomes a clumsy bolt-on. If you build from the knowledge boundaries down, RAG is the layer, and the LLM is just the execution engine.
The router doesn't "route to models." It resolves a policy, applies RAG on input, picks a model, applies RAG on output, and dispatches to external services if needed. The LLM is one of several things that happen inside the resolved policy, not the center of it.
Route, trace, RAG
Compress it to a triangle:
- Route. Every request goes through one place. Right policy, right credentials, right model, right services. No bypass.
- Trace. Everything emits structured, attributable signal. Workspace identity on every span. Per-workspace visibility into which level each config came from.
- RAG. Knowledge is a first-class layer, not a feature.
Three verbs. Get them right and everything else is tractable. Get any one wrong and the system is insecure, opaque, or hallucinating. Pick your poison.
Most platforms nail one. A few nail two. I haven't seen one that nails all three in a coherent way. That's the gap I'm building into.
The idea I keep not deleting: hivemind standup
The speculative piece I can't shake.
Imagine workspaces don't just sit idle between tasks. They wake up on their own cadence, call it a standup, and talk to each other about open issues. Each workspace has an energy budget proportional to how much meaningful work it did last cycle. High-signal work earns energy. Churn doesn't. They use that energy to flag blockers, propose collaborations, ask for help from workspaces with knowledge they don't have.
The honest framing: this is a social test for the agents. Do they self-organize usefully, or just generate noise and cost? Do productive workspaces emerge as hubs? Does the hivemind surface blockers humans missed, or amplify them?
I don't know. That's why I want to build it. A fleet of agents that can't cooperate without human orchestration is a fleet of expensive Mechanical Turks. A fleet that can is something else.
Where I am right now: one workspace, full stack, running. Next is getting repo and company policy layers to cascade the way the design says they should. That's the proof point. The traces above are real in shape, simplified in detail. When repo and company are in, the policy_source block starts telling a richer story, and I'll write that up too.
If you're building something that rhymes with this, I want to talk. If you're shipping it already and I should just use yours, I also want to talk.
Writing it down was step one.

Top comments (0)