Self-Hosting AI Agents vs Managed: Honest Trade-offs From the Trenches

#ai #agents #selfhosting #devops

[Self-Hosting AI Agents vs Managed: Honest Trade-offs]
A few months in, I keep coming back to the same conversation with people building on agents: should you self-host, or just pay someone to run them for you? It sounds like a procurement question. In practice it's a question about how much weirdness you're willing to live with, and how much of the weirdness you want to be yours.

I run a small AI agent service called RapidClaw. My brother Brandon is the tech lead and we cap the number of concurrent agents we run at five — not as a marketing line, as an honest constraint. Five is the number where I can still look at every trace, name every memory key, and tell you what each agent did yesterday. Past that, I start lying to myself about what I actually understand. So I'd rather be small and clear than big and fuzzy.

That bias colors everything below. I'm not trying to talk anyone out of using a managed platform. I'm trying to write down the trade-offs the way I actually experienced them, in case it saves someone a weekend.

The honest pitch for managed

If you've never run an agent in production, start managed. I mean it. The boring stuff — retries, queueing, evals harness, secret rotation, log shipping, a UI someone other than you can use — is six to eight weeks of work that doesn't move your product forward. You're paying a managed provider to skip that, and skipping it is correct when the agent isn't yet the thing your customers love.

The catch is that "managed" is doing a lot of work in that sentence. There's managed-as-in-hosted (your prompts, their runtime), and there's managed-as-in-opinionated (their prompts, their runtime, their memory model). The second kind feels great in week one and starts to chafe in week six, when you realize you can't see why an agent decided what it decided, and your only recourse is a support ticket.

The questions I'd push on before signing anything:

Can I export every trace, every tool call, every memory write — as JSON, on demand, without a CSV button hidden three menus deep?
When a run fails, do I get the actual model response, or a sanitized "something went wrong"?
If I want to swap the underlying model next quarter, is that a config change or a rewrite?
What does the bill look like at 10x my current usage? At 100x? Is it linear, or is there a cliff at the "enterprise" tier?

If the answers are clean, managed is a fine home for a long time.

Why I ended up self-hosting anyway

For RapidClaw the deciding factor wasn't cost. It was the loop time on debugging. We were chasing a memory bug where an agent kept hallucinating a customer's preferred timezone. On a managed runtime I could see the final output and the tool calls, but not the actual sequence of memory reads the agent did before responding. Two days of poking later, Brandon stood up a small local runtime and we found it in twenty minutes — the agent was reading a stale snapshot because the memory write from the prior turn hadn't been flushed before the next read.

That's the kind of bug you can only catch when you can stop the world and look at it. Managed platforms are getting better at this, but "better" is not the same as "I can drop a print statement wherever I want."

The other thing that pushed us was the customer mix. Most of our customers want their agents running in their own VPC, on their own keys, talking to their own internal data. "Send your data to our SaaS" is a non-starter for them. So a self-hosted setup wasn't a nice-to-have — it was the product.

What self-hosting actually costs (the parts nobody warns you about)

Compute is the easy line item. Even the embarrassingly inefficient version of running five agents on a single mid-tier box costs less per month than one good lunch. That's not where the bill is.

Where the bill is:

Observability. You will reinvent some version of structured tracing for agent steps. Tool call in, model response out, memory delta, retry attempts, token counts. You can lean on OpenTelemetry, but the agent-shaped semantics are still yours to define. Budget two weeks the first time and another week every quarter to keep it honest.

Eval harness. Without a managed eval surface, you need to build the small, ugly version yourself. A folder of scenarios, a runner that hits each one, a diff viewer for outputs. It can be a hundred lines of Python. It cannot be zero lines of Python.

On-call. The first time an agent loops forever at 3am, you find out whether you have an on-call rotation. We didn't. Now we do. It's two people taking turns and a Pagerduty free tier, but it exists.

Memory state, specifically. This is the one I underestimated most. Agents that hold any state across turns — which is most useful agents — turn small bugs in your memory layer into very weird behavior in the model layer. I now spend more time thinking about how memory is read, written, snapshotted, and pruned than I spend thinking about prompts. If I were starting again, I'd build the memory inspector before I built the second agent.

The middle path most people end up at

Almost no team I've talked to runs a pure managed or pure self-hosted setup for long. The shape that keeps emerging is: managed for the orchestration and the model gateway, self-hosted for the memory and the tool layer. You give up a little observability on the orchestration side, you keep all the observability on the parts where bugs actually live.

That hybrid is what we ended up shipping for our own customers. The agent dashboard runs as a managed service so people don't have to host a UI, but the agents themselves run wherever the customer wants — their cluster, their keys, their VPC. It's not the cleanest architecture story, and it took us longer than I'd like to admit to stop apologizing for it.

What I'd tell past-me

Three things, none of them clever:

Pick the option that makes your debugging loop shorter, not the one that makes your slide deck better. If you can't see why an agent did what it did, you don't have an agent — you have a wishing well.
Cap the number of concurrent agents at a number you can mentally model. For us that's five. For a bigger team it might be twenty. It is almost certainly not "as many as the platform supports."
Write the boring runbooks early. The retry policy, the memory snapshot policy, the rollback procedure. They feel like overkill until the first real outage, after which they feel like the only adult thing in the room.

If any of this is useful, or if you want to compare notes on what you've broken, I'd love to hear about it — the RapidClaw team is small enough that you'll get an actual human, probably me or Brandon.

We're still figuring this out. I just wanted to write down what we've found so far, while it still feels true.