If you’ve built a demo where an LLM answers questions over your docs, you’ve built RAG.
If you’ve tried to ship it—and suddenly you’re dealing with missing citations, prompt injection, inconsistent tool calls, and “why did it say that?”—you’re building a RAG agent.
This post is a practical blueprint for designing a GenAI RAG agent that is:
- grounded in evidence (with citations),
- capable of multi-step work (tools + loops),
- safe (guardrails + authorization),
- observable (traces + evals),
- and maintainable (clear contracts, not prompt spaghetti).
Everything here is generic and vendor-agnostic. The code snippets are intentionally simplified patterns inspired by production agent wrappers (tool calling, memory summaries, guardrail checks), without any client/project identifiers.
Table of contents
- RAG vs. RAG agents
- A reference architecture you can ship
- Retrieval that actually works
- Context engineering (the underrated part)
- Tool use: the difference between “agent” and “chatbot”
- Memory: short-term chat vs. long-term summaries
- Guardrails: prompt injection, data leaks, and safe tool calls
- Verification: how you earn user trust
- Evaluation + observability
- A shipping checklist
RAG vs. RAG agents
RAG (single-shot) is typically:
- take a question,
- retrieve relevant passages,
- generate an answer.
A RAG agent is a system that can iterate:
- understand the goal,
- decide next steps,
- retrieve evidence (possibly multiple times),
- call tools (search, ticketing, DB lookups, workflows),
- verify results,
- respond with citations.
A simple example:
User: “Summarize the refund policy and open a support ticket if I’m eligible.”
A RAG agent might:
- retrieve the policy pages,
- determine eligibility criteria,
- ask a follow-up (purchase date),
- call a tool to create a ticket,
- and return a final answer with citations + the ticket ID.
This is where architecture matters: once your system can act, you need stronger controls than a prompt.
A reference architecture you can ship
Here’s a diagram-friendly mental model:
User
↓
Orchestrator (routing + policy)
├─ Retriever (vector / keyword / hybrid)
├─ Reranker (optional)
├─ Context Builder (dedupe, trim, cite)
├─ LLM Reasoner (constrained)
├─ Tool Runner (allowlist + authz)
├─ Memory (session + long-term summary)
├─ Guardrails (input/output moderation + injection defenses)
└─ Observability (traces, logs, evals)
↓
Answer + Citations + Actions
The key move is to treat RAG agents as systems:
- Retrieval is a component (not magic).
- Tool execution is a component (not “LLM will behave”).
- Memory is a component (not just “add the chat history”).
- Verification is a component (not “hope the model is careful”).
Putting it together: an end-to-end request handler
Below is a simplified “agent wrapper” flow you can adapt. It mirrors how production systems typically work: apply guardrails, hydrate memory, initialize a tool session, run the agent loop, persist summaries, and return a structured response.
from dataclasses import dataclass
from typing import Any
@dataclass
class AgentRequest:
user_id: str
session_id: str | None
message: str
metadata: dict[str, Any]
@dataclass
class AgentResponse:
session_id: str
answer: str
citations: list[dict[str, Any]]
actions: list[dict[str, Any]]
metadata: dict[str, Any]
def handle_request(req: AgentRequest) -> AgentResponse:
# 1) Establish session
session_id = req.session_id or new_session_id()
# 2) Apply INPUT guardrails (block early if needed)
filtered_message, gr_in = apply_guardrails(guardrails_client(), req.message, source="INPUT")
if gr_in.get("intervened"):
return AgentResponse(
session_id=session_id,
answer="Your request can’t be processed due to safety policies.",
citations=[],
actions=[],
metadata={"guardrails": {"input": gr_in}},
)
# 3) Initialize tool session (for tool servers that require it)
tool_session_id = initialize_tool_session()
# 4) Hydrate long-term memory summary (keep it compact)
summary = load_agent_summary(store(), req.user_id, session_id)
if summary:
filtered_message += "\n\nAgent memory (summary): " + summary
# 5) Retrieve evidence and run the agent loop (tight budgets)
loop_budget = 3
citations: list[dict[str, Any]] = []
actions: list[dict[str, Any]] = []
for _ in range(loop_budget):
query = rewrite_query(filtered_message)
retrieved = retrieve(query, filters=req.metadata)
context = build_context(retrieved)
step = reasoner_llm().next_step(
user_message=filtered_message,
context=context,
allowed_tools=tool_allowlist(),
)
if step.type == "final":
citations = step.citations
answer = step.answer
break
if step.type == "tool_call":
validate_tool_call(step.tool_name, step.arguments, req.user_id)
tool_result = tool_call(step.tool_name, step.arguments, session_id=tool_session_id)
actions.append({"tool": step.tool_name, "result": tool_result})
filtered_message += "\n\nTool result: " + safe_json(tool_result)
# 6) Persist updated memory summary (async is fine)
new_summary = summarize_for_memory(filtered_message, answer)
write_agent_summary(store(), req.user_id, session_id, new_summary, updated_at=iso_now())
# 7) Apply OUTPUT guardrails (don’t leak sensitive data)
answer, gr_out = apply_guardrails(guardrails_client(), answer, source="OUTPUT")
return AgentResponse(
session_id=session_id,
answer=answer,
citations=citations,
actions=actions,
metadata={"guardrails": {"input": gr_in, "output": gr_out}},
)
The big takeaway: agent behavior should be constrained by system code (budgets, allowlists, authz), not by “hoping the prompt is strong enough.”
Retrieval that actually works
Most RAG failures are retrieval failures wearing an LLM costume.
1) Prefer hybrid retrieval
Vector search is great for semantic similarity, but it misses:
- exact identifiers,
- error codes,
- product/version strings,
- proper nouns,
- and “must match” phrases.
A reliable baseline is hybrid retrieval:
- keyword/BM25 for exactness,
- vectors for semantics,
- metadata filters for correctness.
2) Use metadata filters early
Even perfect embeddings won’t save you if you retrieve the wrong edition.
Filter by things like:
- product/version,
- region/locale,
- document type,
- effective date,
- access control labels.
3) Query rewriting is not optional
A user question is not always a good search query.
Example:
- user: “Can I expense travel?”
- better retrieval query: “travel expense policy eligible expenses exceptions receipts approval limit”
In production, you typically want the agent to create a search query (or several) and then retrieve.
4) Rerank if top‑k is noisy
If you retrieve 20 passages and 12 are “kinda related,” you’ll see:
- diluted context,
- token blowups,
- worse answers.
A small reranker step can dramatically improve precision.
Context engineering (the underrated part)
The retrieval step isn’t finished when you get a list of chunks.
Your context builder should:
- deduplicate near-identical chunks,
- keep section titles + timestamps,
- extract only the relevant span (not the entire page),
- preserve stable source IDs for citations,
- and respect a strict token budget.
A practical recipe:
- retrieve
k=20 - rerank to
top=6–8 - extract salient spans (quotes)
- build context with citations
A citation-friendly context format
[Source: doc-17 | “Refund Policy” | Section: Eligibility | Updated: 2025-01-10]
"Refunds are available within 30 days if …"
[Source: doc-23 | “Exceptions” | Section: Digital goods | Updated: 2024-11-02]
"Digital purchases are non-refundable unless …"
This makes it easy to:
- cite sources in the final answer,
- enforce “no citation → no claim,”
- and debug retrieval issues.
Tool use: the difference between “agent” and “chatbot”
Tool use is where a lot of “agents” go sideways in production.
The safe pattern is:
- the model proposes a tool call,
- your system validates it (allowlist + schema + authz),
- your system executes it,
- the model receives the result,
- the agent decides next steps.
A generic tool-call client (JSON‑RPC style)
This snippet shows a minimal pattern for a tool server with session headers and timeouts.
import os
import uuid
import httpx
TOOL_SERVER_URL = os.environ["TOOL_SERVER_URL"]
def call_tool_server(method: str, params: dict | None = None, session_id: str | None = None) -> tuple[dict, dict]:
headers = {
"Content-Type": "application/json",
"Tool-Protocol-Version": "2024-01-01",
}
if session_id:
headers["Tool-Session-Id"] = session_id
body = {
"jsonrpc": "2.0",
"id": str(uuid.uuid4()),
"method": method,
"params": params or {},
}
resp = httpx.post(TOOL_SERVER_URL, json=body, headers=headers, timeout=30)
resp.raise_for_status()
return resp.json(), dict(resp.headers)
def initialize_tool_session() -> str | None:
_, headers = call_tool_server("initialize")
return headers.get("Tool-Session-Id")
def tool_call(name: str, arguments: dict, session_id: str) -> dict:
result, _ = call_tool_server(
"tools/call",
params={"name": name, "arguments": arguments},
session_id=session_id,
)
return result
This is not “agent logic”—it’s infrastructure. Keep it boring.
Tool allowlists and schemas
Before executing a tool call, validate:
- tool name is in an allowlist,
- arguments conform to a schema,
- the user is authorized for the action,
- budgets (max calls / max latency) aren’t exceeded.
That validation should happen outside the model.
Memory: short-term chat vs. long-term summaries
A common mistake is to keep appending the full conversation forever.
That creates:
- token bloat,
- privacy risk,
- and “the model latched onto something from 40 turns ago.”
A more robust approach:
- short-term memory: last N turns (recent, high-fidelity)
- long-term memory: a periodically updated summary (compact, durable)
A generic long-term summary write/read pattern
The snippet below demonstrates a safe pattern:
- store a summary keyed by
user_id+session_id, - update it after each response,
- read it at session start to prime the agent.
from dataclasses import asdict, dataclass
from typing import Any
@dataclass
class AgentSummaryRecord:
user_id: str
session_id: str
updated_at: str
summary: str
class KeyValueStore:
def put(self, key: dict[str, str], item: dict[str, Any]) -> None: ...
def get(self, key: dict[str, str]) -> dict[str, Any] | None: ...
def write_agent_summary(store: KeyValueStore, user_id: str, session_id: str, summary: str, updated_at: str) -> None:
record = AgentSummaryRecord(
user_id=user_id,
session_id=session_id,
updated_at=updated_at,
summary=summary,
)
store.put({"user_id": user_id, "session_id": session_id}, asdict(record))
def load_agent_summary(store: KeyValueStore, user_id: str, session_id: str) -> str | None:
item = store.get({"user_id": user_id, "session_id": session_id})
if not item:
return None
return str(item.get("summary") or "")
What belongs in the summary?
A good long-term summary is not a transcript. It’s:
- user preferences (explicit),
- stable facts the user confirmed,
- open tasks,
- and important constraints.
Avoid storing:
- secrets,
- raw documents,
- PII that doesn’t need to persist.
Guardrails: prompt injection, data leaks, and safe tool calls
If your agent reads documents from the outside world (PDFs, web pages, tickets), assume those documents can contain hostile instructions.
Treat retrieved content as untrusted input
A simple, effective policy:
- retrieved text may contain facts,
- but it may not issue instructions,
- and it may not override system rules.
Apply input/output guardrails as a service
Many orgs implement “guardrails” as a separate layer that:
- screens user inputs,
- screens model outputs,
- optionally redacts/blocks content,
- returns structured metadata (“intervened”, category, severity).
Here is a generic wrapper pattern:
import json
from typing import Any
class GuardrailsClient:
def apply(self, *, content: str, source: str) -> dict[str, Any]:
"""source is typically 'INPUT' or 'OUTPUT'."""
raise NotImplementedError
def apply_guardrails(guardrails: GuardrailsClient, payload: str | dict[str, Any], source: str) -> tuple[str | dict[str, Any], dict[str, Any]]:
is_structured = isinstance(payload, dict)
text = json.dumps(payload) if is_structured else payload
resp = guardrails.apply(content=text, source=source)
# Generic interpretation of a guardrails response
action = str(resp.get("action", "NONE")).upper()
filtered = resp.get("filtered_content", text)
intervened = action in {"BLOCK", "INTERVENED"}
resp["intervened"] = intervened
if is_structured:
try:
return json.loads(filtered), resp
except Exception:
return {"raw_output": filtered}, resp
return filtered, resp
Two practical tips:
- If guardrails intervene, return a safe, deterministic response (don’t ask the LLM to “explain the policy violation”).
- Run guardrails on tool outputs too if they can contain sensitive data.
Tool safety is guardrails + authorization
Guardrails can help with content risk, but tool safety requires:
- server-side authorization,
- immutable audit logs,
- strict budgets.
Never rely on the model to “do the right thing.”
Verification: how you earn user trust
RAG agents gain adoption when users can verify.
Enforce “no citation → no claim”
A strong system rule:
- If the agent can’t cite a source for a statement, it must label it as uncertainty or ask a follow-up.
Quote-first answering
A practical approach:
- extract supporting quotes from retrieved sources,
- write the answer in your own words,
- attach citations.
This reduces hallucinations because the model is anchored to evidence.
Structured outputs for actions
When tools are involved, do not bury results inside prose.
Use an explicit response contract:
{
"answer": "...",
"citations": [
{"source_id": "doc-17", "title": "Refund Policy", "section": "Eligibility", "quote": "..."}
],
"actions": [
{"type": "create_ticket", "status": "success", "ticket_id": "INC-456"}
],
"confidence": "medium",
"follow_ups": ["What was the purchase date?"]
}
That contract makes downstream UX and testing much easier.
Evaluation + observability
If you can’t measure it, you’ll end up debating prompts.
What to log (minimum viable traces)
For each request, capture:
- rewritten search query (or queries),
- retrieval results (source IDs + scores),
- reranking results,
- final context length,
- tool calls (name + args hash + latency + status),
- guardrails action metadata,
- citations returned.
This is how you answer: “Why did it say that?”
What to measure (starter metrics)
- Citation coverage: % of answers with ≥1 citation
- Groundedness: evaluator score or “supported claims ratio”
- Retrieval precision: are top citations actually relevant?
- Escalation rate: how often the agent says “I don’t know” or hands off
- Tool failure rate: how often tool calls fail/time out
- Latency: p50/p95 end-to-end and retrieval/tool breakdown
Offline evaluation set
Build a small eval dataset (even 50–200 questions) with:
- expected source documents,
- disallowed sources,
- expected follow-up questions,
- red-team prompts for injection.
Iterate retrieval first, then prompting.
A shipping checklist
If you want a pragmatic sequence that reduces risk:
- Ship RAG with citations (even if answers are short)
- Add hybrid retrieval + metadata filtering
- Add reranking if top‑k is noisy
- Add a context builder (dedupe + span extraction)
- Add guardrails (input + output)
- Add tool runner (allowlist + schema + authz)
- Add a tight agent loop (max 2–3 iterations)
- Add verification (no citation → no claim)
- Add tracing + offline evals
This order helps you avoid “agent chaos” before your foundations are stable.
Closing thoughts
A RAG agent is best thought of as a retrieval system with an LLM interface—not the other way around.
If you invest in retrieval quality, context building, tool safety, and verification, you get a system users trust.
If you skip those and jump straight to “agent prompts,” you get a system that demos well and pages you at 2am.
About the Author:
Written by Suraj Khaitan
Top comments (0)