Customer support is one of the few places where RAG and agents earn their keep immediately: the questions are real, the knowledge changes constantly, and a wrong answer has a cost. I built an open-source agentic RAG platform for support automation, and the design choice I keep coming back to is that almost everything should be configuration, not code.
Repo: https://github.com/ahmet-ozel/agentic-rag-customer-support
Why config-driven
A support assistant is never "done." You add a new product, a new escalation rule, a new data source, a new tone of voice. If each of those changes means editing Python and redeploying, the system rots. So the agent behavior, the tools it can call, the data sources, and the routing rules all live in configuration. Adding a knowledge source or a new tool is an edit to config, not a code change.
This also makes the system easier to reason about. You can read one config file and know what the agent is allowed to do, where it gets its knowledge, and how it decides what to answer.
The pieces
The platform wires together a few components behind a FastAPI server:
- An LLM as the reasoning core
- MCP servers as the tool layer (postgres, qdrant, docling, paddleocr), so the agent can query a database, search a vector store, parse documents, and run OCR through a uniform tool interface
- A vector database (Qdrant) for retrieval
- A document pipeline that ingests and processes the knowledge base
- An intent router that decides what kind of request came in
- An agent loop that plans, calls tools, checks results, and answers
The intent router matters more than the model
The instinct is to send everything to one big agent and let it figure things out. In practice, a lightweight intent router in front of the agent does a lot of work: a simple FAQ lookup does not need a multi-step agent, and a billing question needs different tools than a how-to question. Routing first keeps cost down and latency predictable, and only sends the genuinely hard requests into the full agent loop.
The agent loop
For the requests that do need it, the agent runs an iterative tool-calling loop: read the request, decide which tool to use (retrieve from the vector store, query postgres, parse a document), evaluate whether the result is sufficient, and either answer or take another step. MCP is what keeps this clean. The agent reasons about which tool to call; it does not need to know how each backend works.
What I would do differently
The biggest lesson was to invest in evaluation early. It is easy to demo a support agent that answers three questions well. It is hard to know whether a config change made it better or worse across a hundred real questions. If I started over, I would build the eval harness before the second feature.
Repo and setup: https://github.com/ahmet-ozel/agentic-rag-customer-support
If you have built support automation with RAG, I would like to hear how you handle routing and escalation to a human. Where do you draw the line on letting the agent answer versus handing off?
Top comments (5)
That makes sense. Historical escalation logs are probably the cleanest bootstrap signal because they already encode where confidence was not enough in the real workflow. I would still keep a small reviewed set per intent, especially for billing/account-change paths, but using logs first should make the cold start much less painful. In AgentDesk I would probably turn that into an eval set and revisit the thresholds after each batch of real conversations.
On the handoff question: in our LangGraph support graph we made human-escalation an edge condition, not a tool the agent can choose. Each node emits a confidence score and the conditional edge fires when it dips below a per-intent floor (billing strict, FAQ lenient). Making it a hard graph edge instead of a tool was what finally stopped the agent from confidently answering past its competence.
Thanks, this is a solid pattern. I like making escalation a graph edge instead of a tool the agent can choose. In production flows, handoff belongs to the control layer, not the action set. Per-intent thresholds also make sense; billing and account changes should be stricter than FAQ. I may try this in my AgentDesk flow.
Glad it resonates. One thing I found in production: the threshold calibration phase is the most painful part. You end up manually labeling a few hundred intent samples before the floors feel right. For AgentDesk, if you already have historical escalation logs, those are gold for bootstrapping the per-intent floors. Would skip the cold-start entirely. Let me know how it goes.
Agreed. Historical escalation logs are probably the best shortcut if the label quality is decent. I would start with severity-based priors, backfill a small labeled set, then tighten the thresholds after a few live review cycles. That feels safer than overfitting the first batch.