Retrieval-Augmented Generation (RAG) has fundamentally transformed how AI systems access and reason over external knowledge. Instead of relying purely on what a model learned during training, RAG allows the model to retrieve fresh, relevant documents at query time, grounding its responses in real, up-to-date data.
However, as real-world use cases grow more complex, the traditional single-agent RAG architecture begins to show limitations. What happens when your knowledge exists across multiple sources? Product documentation, historical support tickets, and live web data each require distinct retrieval strategies. A single retriever attempting to handle all of them either misses critical context or overwhelms the LLM with irrelevant noise.
Multi-Agent RAG addresses this challenge. Instead of one agent handling everything, you build a coordinated system: specialised agents that own individual knowledge sources, a routing agent that decides which agents to activate, and a synthesis agent that composes the final grounded answer. In this post, we will walk through how to build this architecture using LangChain.
Use Case
magine you are developing a support chatbot for a SaaS product. Users might ask:
“How do I configure OAuth in your API?”
“Was the login bug from last month ever resolved?”
“What are the latest changes in the v3.0 release?”
Each question requires access to a different knowledge source. The first depends on product documentation. The second relies on support ticket history. The third may require recent release notes or even live web updates.
A single RAG agent would attempt to blend all sources into one retrieval step, often producing diluted or confused answers.
Multi-Agent RAG assigns each knowledge source to a dedicated retrieval agent. A router interprets the user’s intent and activates only the relevant agents. The result is faster, more precise, and significantly more scalable.
Multi-Agent RAG Architecture
Before writing code, it is important to understand the complete system flow. The diagram below illustrates how a user query moves through the architecture to produce a final answer.
Figure 1: Multi-Agent RAG — end-to-end data flow
The system consists of five major stages:
Setting Up the Agents
The foundation of the system is straightforward: each knowledge source gets its own vector store, retriever, and tightly scoped system prompt. The narrower the scope, the higher the retrieval precision.
Here is how the shared infrastructure is initialized:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.tools.retriever import create_retriever_tool
llm = ChatOpenAI(model="gpt-4o", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
Separate vector stores per knowledge source
docs_vs = FAISS.from_documents(docs_documents, embeddings)
tickets_vs = FAISS.from_documents(ticket_documents, embeddings)
Why separate vector stores?
Combining all documents into one store forces the retriever to score similarity across unrelated domains. Isolating stores ensures cleaner similarity matching and reduces cross-domain noise.
The agent factory function below can be reused for each knowledge source. Notice that the description parameter plays a crucial role — it informs the router when this agent should be invoked.
def build_agent(vectorstore, name, description):
tool = create_retriever_tool(
vectorstore.as_retriever(search_kwargs={"k": 5}),
name=name, description=description
)
prompt = ChatPromptTemplate.from_messages([
("system", f"You are a retrieval agent for {name}. Be precise and concise."),
("human", "{input}"),
("placeholder", "{agent_scratchpad}")
])
agent = create_openai_functions_agent(llm, [tool], prompt)
return AgentExecutor(agent=agent, tools=[tool])
docs_agent = build_agent(docs_vs, "docs_retriever", "Product documentation and API guides")
tickets_agent = build_agent(tickets_vs, "tickets_retriever", "Customer support ticket history")
The Router Agent
The router is the decision-making core of the system. It analyzes the incoming query and determines which retrieval agents to activate.
The key design decision here is structured JSON output. This ensures routing decisions are transparent, deterministic, and easy to debug.
Setting temperature=0 for the router is essential. Routing requires consistency, not creativity.
ROUTER_PROMPT = """
Route the query to the correct agents. Return valid JSON only:
{"agents": [...], "reasoning": "..."}
Agents available:
- docs_retriever → technical documentation, API references, how-to guides
- tickets_retriever → support tickets, bug reports, issue resolutions
Example 1: "How do I reset my API key?"
{"agents": ["docs_retriever"], "reasoning": "API key management is covered in documentation"}
Example 2: "Was the 2FA bug from March resolved?"
{"agents": ["tickets_retriever", "docs_retriever"],
"reasoning": "Ticket history provides context; documentation confirms the fix"}
"""
The reasoning field is not decorative — log it in production. It becomes invaluable when debugging routing decisions.
Parallel Retrieval and Context Aggregation
After routing, selected agents execute in parallel using asyncio. This is one of the most significant advantages of multi-agent RAG: latency is determined by the slowest agent, not the sum of all agents.
Once retrieval completes, context aggregation removes duplicate content. Duplicate passages waste context window space and may distort synthesis.
async def retrieve_parallel(query, agent_names):
tasks = [AGENT_MAP[n].ainvoke({"input": query})
for n in agent_names if n in AGENT_MAP]
results = await asyncio.gather(*tasks)
seen, unique = set(), []
for r in results:
h = hash(r["output"])
if h not in seen:
seen.add(h)
unique.append(r["output"])
return unique
Synthesis and Final Answer Generation
The synthesis agent receives the deduplicated context and produces the final grounded response.
Prompt discipline is critical here. The model must remain strictly anchored to retrieved context.
async def answer_query(query):
routing = route_query(query)
contexts = await retrieve_parallel(query, routing["agents"])
combined = "\n\n---\n\n".join(contexts)
prompt = f"""
Answer using ONLY the context provided below.
If the context is insufficient, say: "I don't have enough information."
Context:
{combined}
Question: {query}
"""
return (await llm.ainvoke(prompt)).content
The phrase “ONLY the context provided below” significantly reduces hallucination by preventing the model from relying on its internal training knowledge.
Prompt Engineering Strategies
Prompt quality has the highest leverage across the system.
Few-Shot Prompting
Providing 2–3 routing examples dramatically improves classification accuracy.
Structured Output
Enforcing JSON ensures integration reliability and supports automated validation.
Context Anchoring
Explicitly instructing the model to rely only on retrieved context improves factual consistency.
Evaluation and Optimization
Deployment without evaluation is risky. You should measure:
Routing accuracy
Context precision
Answer faithfulness
Re-run evaluations after prompt changes, not just code updates. Small prompt tweaks can shift routing accuracy significantly.
Future Improvements
- Agent memory for multi-turn continuity
- Self-correcting retrieval loops
- Dynamic agent creation for new knowledge sources
- Hierarchical routing layers
- Cost-aware routing strategies
Conclusion
Multi-Agent RAG is not about unnecessary complexity. It is about giving each knowledge source the specialisation it deserves.
- Specialisation improves retrieval precision
- Measure before optimising
- Parallelism minimises latency overhead
- Router prompt quality defines system reliability
- Start simple and scale intentionally
Begin with two agents. Measure routing performance. Iterate deliberately.



Top comments (0)