Agentic Support Agent for a Platform Teams

#llm #rag #ai #devex

One of the most tiring tasks for R&D teams is supporting their own services. For platform teams it's worse — there's no customer success, no account management, so the entire load falls on engineers who weren't trained for it and didn't sign up for it.

The questions never stop, and a lot of them are the same questions — asked again because the last answer wasn't written down anywhere findable.

After years of being on call, I knew this was the right shape for an agent. Take load off the team, give a thorough answer every time, and don't get short with anyone.

The First Attempt: RAG over the Docs

I started where most people start — by trying to onboard the LLM into the team. Show it what we know, and let it answer from there. I indexed everything I could find: how-to guides, decision docs, service documentation, the whole pile. The agent could now search the corpus and come back with an answer.

The answers weren't good enough.

Two different things were going wrong, and at first they were hard to tell apart.

The first was the documents themselves. Some were years out of date. Some were written by an engineer in a hurry, others by someone with time to be careful, and the agent had no way to tell which was which. Some corners of the platform had thorough write-ups; others had nothing. Whatever the docs were — accurate, stale, partial, or missing — the agent surfaced them with the same confidence.

The second was the RAG mechanics. How you chunk a document changes what gets retrieved — split a paragraph in the wrong place and the relevant sentence ends up separated from the context that makes it meaningful. The embedding model decides what "similar" means, and a mediocre one will happily return a doc that sounds like the question but doesn't actually answer it. And there's the top-k problem: if you return three docs and the right one is the fourth, you don't just get a slightly worse answer — you get a confidently wrong one, because the LLM still has to answer from something.

You can fix the second category with engineering. Better chunking, better embeddings, rerankers, query rewriting — there's a whole industry of techniques for this, and they help. But none of them fix the first category. No retrieval strategy turns a stale doc into a fresh one. And that's the part that started to bother me.

Skipping the RAG Rabbit Hole

So I didn't go down it. I stopped and asked a different question — not about the docs, but about the job: what does an on-call engineer actually do?

They know a fixed set of things — what the services are, how the platform is wired — and when a question needs more, they go look at the live system: read the code, check the logs. That's the whole job. What's painful about it is doing that digging over and over while staying patient at 2am. Notice what's not in there: nobody searches a pile of documents. So RAG was the wrong shape from the start — not because the docs were stale (though they were; they're copies of the code, and copies drift), but because document-retrieval was never part of the job I was trying to copy.

Once I framed it that way, the path forward was obvious. Don't get better at retrieving descriptions of the system. Give the agent access to the system.

The Second Attempt: An Agent with Tools

I dropped the RAG system entirely and went to the source: I gave the agent the ability to read code directly. No chunks, no embeddings, no index to keep fresh — just a tool that clones the repo and searches it at query time. When someone asks how the retry logic handles a specific edge case, the agent goes and reads the retry logic.

But once I'd opened that door, the same logic applied to everything else. The platform's behavior doesn't only live in the code — it lives in the logs when something is failing, in the metrics when something is slow, in the deployment history when something changed last Tuesday. Each of those is its own source of truth for a different kind of question. So I gave the agent tools for those too.

That's the moment it stopped being a RAG system with a different backend and became something else: an agent that doesn't retrieve knowledge about the platform — it queries the platform.

Giving the Agent an Onboarding

Reading code at query time solved one problem and exposed another. To read a service's code, the agent first has to know the service exists and where its code lives — and at first it didn't. Even once you got past that, it could find anything but didn't know anything: every question started from zero. Ask about the database, and it would search the repo to rediscover we're on Postgres. Ask about CI, and it would search again to find CircleCI and ArgoCD. Correct answers, but slow, and oddly disorienting — like talking to a contractor who has to look up where the kitchen is every time.

A human engineer doesn't work this way. By their second week, they know we run on AWS, that databases are RDS Postgres, that CI is CircleCI and deploys go through ArgoCD. They don't look it up — it's just resident, the background the rest of their thinking happens on top of. RAG has no equivalent: it retrieves every fact on demand, so nothing is ever simply known.

So I gave the agent that onboarding. A short doc — one paragraph per service describing what it does and the key technologies behind it, the same things you'd tell a new hire in week one. Not how it works (that's in the code), not what its current state is (that's in the tools), just what it is. "The gateway handles retries and rate limiting" stays true for years even as the retry logic underneath changes every sprint — so the what is worth carrying in the prompt. It's small, it's the same on every request (so it caches well), and it saves a search every time.

The split turned out to be the whole architecture, in retrospect. Slow-changing context — what the system is — lives in the prompt. Fast-changing context — what the code does, what the system is doing right now — lives behind tools. The agent gets the same orientation a new hire would, plus the ability to look at the actual system whenever a question needs it.

So What Are Docs For?

If the code answers the hard questions and the onboarding doc handles orientation, it's fair to ask what documentation is still for. The answer isn't "less" — it's "the part the code can't tell you."

The code is the complete, current answer to what the system does and how. Writing that down in prose just creates a copy that starts rotting immediately. What the code can never tell you is why — why this design over the alternative, what tradeoff you accepted, what you tried first and threw out. Design docs, decision records, the PRD behind a service: those aren't a leftover after the agent eats the rest, they're the whole point. Today the why and the what get tangled in the same doc, and the what rots and drags the why's credibility down with it. Pull them apart and the valuable half stops rotting — a decision doc explaining why you chose a token bucket is still true years after every doc describing how it works went stale.

In Closing

The whole arc came down to one idea, and it wasn't really about agents or retrieval. It was about copying the on-call engineer.

A person on call has two things: a base of knowledge they carry, and the ability to look at the live system when a question needs more. The agent ended up with exactly those — the onboarding doc is what it knows, the tools are how it looks. I didn't choose two tiers because they're elegant; I copied the only two things the human had, and left out the one thing I was trying to remove: the fatigue of doing it for the hundredth time. RAG was the wrong shape because a document search was never one of those two things.

There's a smaller rule that falls out of it, handy on its own: match the freshness of the context to the freshness of the question. What a service is changes slowly, so it can live in the prompt. What the code does changes constantly, so it has to be read live. RAG sits in the middle — a snapshot pretending to be live — which is exactly where it breaks down. But that's downstream of the real lesson: figure out what you're copying, give it what the original actually has, and most of the rest follows.