Suraj Khaitan

Posted on Dec 28, 2025

Retrieval-Augmented Generation (RAG) Agents: How to Build Grounded, Tool‑Using GenAI Systems

#ai #agents #rag #azure

If you’ve built a demo where an LLM answers questions over your docs, you’ve built RAG.

If you’ve tried to ship it—and suddenly you’re dealing with missing citations, prompt injection, inconsistent tool calls, and “why did it say that?”—you’re building a RAG agent.

This post is a practical blueprint for designing a GenAI RAG agent that is:

grounded in evidence (with citations),
capable of multi-step work (tools + loops),
safe (guardrails + authorization),
observable (traces + evals),
and maintainable (clear contracts, not prompt spaghetti).

Everything here is generic and vendor-agnostic. The code snippets are intentionally simplified patterns inspired by production agent wrappers (tool calling, memory summaries, guardrail checks), without any client/project identifiers.

RAG vs. RAG agents
A reference architecture you can ship
Retrieval that actually works
Context engineering (the underrated part)
Tool use: the difference between “agent” and “chatbot”
Memory: short-term chat vs. long-term summaries
Guardrails: prompt injection, data leaks, and safe tool calls
Verification: how you earn user trust
Evaluation + observability
A shipping checklist

RAG vs. RAG agents

RAG (single-shot) is typically:

take a question,
retrieve relevant passages,
generate an answer.

A RAG agent is a system that can iterate:

understand the goal,
decide next steps,
retrieve evidence (possibly multiple times),
call tools (search, ticketing, DB lookups, workflows),
verify results,
respond with citations.

A simple example:

User: “Summarize the refund policy and open a support ticket if I’m eligible.”

A RAG agent might:

retrieve the policy pages,
determine eligibility criteria,
ask a follow-up (purchase date),
call a tool to create a ticket,
and return a final answer with citations + the ticket ID.

This is where architecture matters: once your system can act, you need stronger controls than a prompt.

A reference architecture you can ship

Here’s a diagram-friendly mental model:

User
  ↓
Orchestrator (routing + policy)
  ├─ Retriever (vector / keyword / hybrid)
  ├─ Reranker (optional)
  ├─ Context Builder (dedupe, trim, cite)
  ├─ LLM Reasoner (constrained)
  ├─ Tool Runner (allowlist + authz)
  ├─ Memory (session + long-term summary)
  ├─ Guardrails (input/output moderation + injection defenses)
  └─ Observability (traces, logs, evals)
  ↓
Answer + Citations + Actions

The key move is to treat RAG agents as systems:

Retrieval is a component (not magic).
Tool execution is a component (not “LLM will behave”).
Memory is a component (not just “add the chat history”).
Verification is a component (not “hope the model is careful”).

Putting it together: an end-to-end request handler

Below is a simplified “agent wrapper” flow you can adapt. It mirrors how production systems typically work: apply guardrails, hydrate memory, initialize a tool session, run the agent loop, persist summaries, and return a structured response.

from dataclasses import dataclass
from typing import Any


@dataclass
class AgentRequest:
    user_id: str
    session_id: str | None
    message: str
    metadata: dict[str, Any]


@dataclass
class AgentResponse:
    session_id: str
    answer: str
    citations: list[dict[str, Any]]
    actions: list[dict[str, Any]]
    metadata: dict[str, Any]


def handle_request(req: AgentRequest) -> AgentResponse:
    # 1) Establish session
    session_id = req.session_id or new_session_id()

    # 2) Apply INPUT guardrails (block early if needed)
    filtered_message, gr_in = apply_guardrails(guardrails_client(), req.message, source="INPUT")
    if gr_in.get("intervened"):
        return AgentResponse(
            session_id=session_id,
            answer="Your request can’t be processed due to safety policies.",
            citations=[],
            actions=[],
            metadata={"guardrails": {"input": gr_in}},
        )

    # 3) Initialize tool session (for tool servers that require it)
    tool_session_id = initialize_tool_session()

    # 4) Hydrate long-term memory summary (keep it compact)
    summary = load_agent_summary(store(), req.user_id, session_id)
    if summary:
        filtered_message += "\n\nAgent memory (summary): " + summary

    # 5) Retrieve evidence and run the agent loop (tight budgets)
    loop_budget = 3
    citations: list[dict[str, Any]] = []
    actions: list[dict[str, Any]] = []

    for _ in range(loop_budget):
        query = rewrite_query(filtered_message)
        retrieved = retrieve(query, filters=req.metadata)
        context = build_context(retrieved)

        step = reasoner_llm().next_step(
            user_message=filtered_message,
            context=context,
            allowed_tools=tool_allowlist(),
        )

        if step.type == "final":
            citations = step.citations
            answer = step.answer
            break

        if step.type == "tool_call":
            validate_tool_call(step.tool_name, step.arguments, req.user_id)
            tool_result = tool_call(step.tool_name, step.arguments, session_id=tool_session_id)
            actions.append({"tool": step.tool_name, "result": tool_result})
            filtered_message += "\n\nTool result: " + safe_json(tool_result)

    # 6) Persist updated memory summary (async is fine)
    new_summary = summarize_for_memory(filtered_message, answer)
    write_agent_summary(store(), req.user_id, session_id, new_summary, updated_at=iso_now())

    # 7) Apply OUTPUT guardrails (don’t leak sensitive data)
    answer, gr_out = apply_guardrails(guardrails_client(), answer, source="OUTPUT")

    return AgentResponse(
        session_id=session_id,
        answer=answer,
        citations=citations,
        actions=actions,
        metadata={"guardrails": {"input": gr_in, "output": gr_out}},
    )

The big takeaway: agent behavior should be constrained by system code (budgets, allowlists, authz), not by “hoping the prompt is strong enough.”

Retrieval that actually works

Most RAG failures are retrieval failures wearing an LLM costume.

1) Prefer hybrid retrieval

Vector search is great for semantic similarity, but it misses:

exact identifiers,
error codes,
product/version strings,
proper nouns,
and “must match” phrases.

A reliable baseline is hybrid retrieval:

keyword/BM25 for exactness,
vectors for semantics,
metadata filters for correctness.

2) Use metadata filters early

Even perfect embeddings won’t save you if you retrieve the wrong edition.

Filter by things like:

product/version,
region/locale,
document type,
effective date,
access control labels.

3) Query rewriting is not optional

A user question is not always a good search query.

Example:

user: “Can I expense travel?”
better retrieval query: “travel expense policy eligible expenses exceptions receipts approval limit”

In production, you typically want the agent to create a search query (or several) and then retrieve.

4) Rerank if top‑k is noisy

If you retrieve 20 passages and 12 are “kinda related,” you’ll see:

diluted context,
token blowups,
worse answers.

A small reranker step can dramatically improve precision.

Context engineering (the underrated part)

The retrieval step isn’t finished when you get a list of chunks.

Your context builder should:

deduplicate near-identical chunks,
keep section titles + timestamps,
extract only the relevant span (not the entire page),
preserve stable source IDs for citations,
and respect a strict token budget.

A practical recipe:

retrieve k=20
rerank to top=6–8
extract salient spans (quotes)
build context with citations

A citation-friendly context format

[Source: doc-17 | “Refund Policy” | Section: Eligibility | Updated: 2025-01-10]
"Refunds are available within 30 days if …"

[Source: doc-23 | “Exceptions” | Section: Digital goods | Updated: 2024-11-02]
"Digital purchases are non-refundable unless …"

This makes it easy to:

cite sources in the final answer,
enforce “no citation → no claim,”
and debug retrieval issues.

Tool use: the difference between “agent” and “chatbot”

Tool use is where a lot of “agents” go sideways in production.

The safe pattern is:

the model proposes a tool call,
your system validates it (allowlist + schema + authz),
your system executes it,
the model receives the result,
the agent decides next steps.

A generic tool-call client (JSON‑RPC style)

This snippet shows a minimal pattern for a tool server with session headers and timeouts.

import os
import uuid

import httpx

TOOL_SERVER_URL = os.environ["TOOL_SERVER_URL"]


def call_tool_server(method: str, params: dict | None = None, session_id: str | None = None) -> tuple[dict, dict]:
    headers = {
        "Content-Type": "application/json",
        "Tool-Protocol-Version": "2024-01-01",
    }
    if session_id:
        headers["Tool-Session-Id"] = session_id

    body = {
        "jsonrpc": "2.0",
        "id": str(uuid.uuid4()),
        "method": method,
        "params": params or {},
    }

    resp = httpx.post(TOOL_SERVER_URL, json=body, headers=headers, timeout=30)
    resp.raise_for_status()
    return resp.json(), dict(resp.headers)


def initialize_tool_session() -> str | None:
    _, headers = call_tool_server("initialize")
    return headers.get("Tool-Session-Id")


def tool_call(name: str, arguments: dict, session_id: str) -> dict:
    result, _ = call_tool_server(
        "tools/call",
        params={"name": name, "arguments": arguments},
        session_id=session_id,
    )
    return result

This is not “agent logic”—it’s infrastructure. Keep it boring.

Tool allowlists and schemas

Before executing a tool call, validate:

tool name is in an allowlist,
arguments conform to a schema,
the user is authorized for the action,
budgets (max calls / max latency) aren’t exceeded.

That validation should happen outside the model.

Memory: short-term chat vs. long-term summaries

A common mistake is to keep appending the full conversation forever.

That creates:

token bloat,
privacy risk,
and “the model latched onto something from 40 turns ago.”

A more robust approach:

short-term memory: last N turns (recent, high-fidelity)
long-term memory: a periodically updated summary (compact, durable)

A generic long-term summary write/read pattern

The snippet below demonstrates a safe pattern:

store a summary keyed by user_id + session_id,
update it after each response,
read it at session start to prime the agent.

from dataclasses import asdict, dataclass
from typing import Any


@dataclass
class AgentSummaryRecord:
    user_id: str
    session_id: str
    updated_at: str
    summary: str


class KeyValueStore:
    def put(self, key: dict[str, str], item: dict[str, Any]) -> None: ...
    def get(self, key: dict[str, str]) -> dict[str, Any] | None: ...


def write_agent_summary(store: KeyValueStore, user_id: str, session_id: str, summary: str, updated_at: str) -> None:
    record = AgentSummaryRecord(
        user_id=user_id,
        session_id=session_id,
        updated_at=updated_at,
        summary=summary,
    )
    store.put({"user_id": user_id, "session_id": session_id}, asdict(record))


def load_agent_summary(store: KeyValueStore, user_id: str, session_id: str) -> str | None:
    item = store.get({"user_id": user_id, "session_id": session_id})
    if not item:
        return None
    return str(item.get("summary") or "")

What belongs in the summary?

A good long-term summary is not a transcript. It’s:

user preferences (explicit),
stable facts the user confirmed,
open tasks,
and important constraints.

Avoid storing:

secrets,
raw documents,
PII that doesn’t need to persist.

Guardrails: prompt injection, data leaks, and safe tool calls

If your agent reads documents from the outside world (PDFs, web pages, tickets), assume those documents can contain hostile instructions.

Treat retrieved content as untrusted input

A simple, effective policy:

retrieved text may contain facts,
but it may not issue instructions,
and it may not override system rules.

Apply input/output guardrails as a service

Many orgs implement “guardrails” as a separate layer that:

screens user inputs,
screens model outputs,
optionally redacts/blocks content,
returns structured metadata (“intervened”, category, severity).

Here is a generic wrapper pattern:

import json
from typing import Any


class GuardrailsClient:
    def apply(self, *, content: str, source: str) -> dict[str, Any]:
        """source is typically 'INPUT' or 'OUTPUT'."""
        raise NotImplementedError


def apply_guardrails(guardrails: GuardrailsClient, payload: str | dict[str, Any], source: str) -> tuple[str | dict[str, Any], dict[str, Any]]:
    is_structured = isinstance(payload, dict)
    text = json.dumps(payload) if is_structured else payload

    resp = guardrails.apply(content=text, source=source)

    # Generic interpretation of a guardrails response
    action = str(resp.get("action", "NONE")).upper()
    filtered = resp.get("filtered_content", text)
    intervened = action in {"BLOCK", "INTERVENED"}

    resp["intervened"] = intervened

    if is_structured:
        try:
            return json.loads(filtered), resp
        except Exception:
            return {"raw_output": filtered}, resp

    return filtered, resp

Two practical tips:

If guardrails intervene, return a safe, deterministic response (don’t ask the LLM to “explain the policy violation”).
Run guardrails on tool outputs too if they can contain sensitive data.

Tool safety is guardrails + authorization

Guardrails can help with content risk, but tool safety requires:

server-side authorization,
immutable audit logs,
strict budgets.

Never rely on the model to “do the right thing.”

Verification: how you earn user trust

RAG agents gain adoption when users can verify.

Enforce “no citation → no claim”

A strong system rule:

If the agent can’t cite a source for a statement, it must label it as uncertainty or ask a follow-up.

Quote-first answering

A practical approach:

extract supporting quotes from retrieved sources,
write the answer in your own words,
attach citations.

This reduces hallucinations because the model is anchored to evidence.

Structured outputs for actions

When tools are involved, do not bury results inside prose.

Use an explicit response contract:

{
  "answer": "...",
  "citations": [
    {"source_id": "doc-17", "title": "Refund Policy", "section": "Eligibility", "quote": "..."}
  ],
  "actions": [
    {"type": "create_ticket", "status": "success", "ticket_id": "INC-456"}
  ],
  "confidence": "medium",
  "follow_ups": ["What was the purchase date?"]
}

That contract makes downstream UX and testing much easier.

Evaluation + observability

If you can’t measure it, you’ll end up debating prompts.

What to log (minimum viable traces)

For each request, capture:

rewritten search query (or queries),
retrieval results (source IDs + scores),
reranking results,
final context length,
tool calls (name + args hash + latency + status),
guardrails action metadata,
citations returned.

This is how you answer: “Why did it say that?”

What to measure (starter metrics)

Citation coverage: % of answers with ≥1 citation
Groundedness: evaluator score or “supported claims ratio”
Retrieval precision: are top citations actually relevant?
Escalation rate: how often the agent says “I don’t know” or hands off
Tool failure rate: how often tool calls fail/time out
Latency: p50/p95 end-to-end and retrieval/tool breakdown

Offline evaluation set

Build a small eval dataset (even 50–200 questions) with:

expected source documents,
disallowed sources,
expected follow-up questions,
red-team prompts for injection.

Iterate retrieval first, then prompting.

A shipping checklist

If you want a pragmatic sequence that reduces risk:

Ship RAG with citations (even if answers are short)
Add hybrid retrieval + metadata filtering
Add reranking if top‑k is noisy
Add a context builder (dedupe + span extraction)
Add guardrails (input + output)
Add tool runner (allowlist + schema + authz)
Add a tight agent loop (max 2–3 iterations)
Add verification (no citation → no claim)
Add tracing + offline evals

This order helps you avoid “agent chaos” before your foundations are stable.

Closing thoughts

A RAG agent is best thought of as a retrieval system with an LLM interface—not the other way around.

If you invest in retrieval quality, context building, tool safety, and verification, you get a system users trust.

If you skip those and jump straight to “agent prompts,” you get a system that demos well and pages you at 2am.

About the Author:

Written by Suraj Khaitan

DEV Community