Last year I was building AI features for a high-traffic editorial platform when I ran into a problem that kept showing up: users needed answers that lived in two places at once. Half the context was buried in internal documents — contracts, policies, style guides — and the other half was out on the open web, changing by the hour. No single retrieval strategy could handle both.
So I built a system where multiple AI agents collaborate to figure out where to look, go find the information, and synthesize it into a single coherent answer. The result is multi-agent-researcher, an open-source CLI tool powered by LangGraph, FAISS, and Ollama.
This post walks through the architecture decisions, the trade-offs I encountered, and the patterns I'd reuse in any multi-agent system.
The Problem: One Question, Multiple Knowledge Sources
Consider these two questions:
- "What does article 3 of the contract say about termination?"
- "What are the latest regulations on remote work in Italy?"
The first one lives in a local document. The second one requires a web search. Simple enough — but what about this one?
"What does our company policy say about remote work, and how does it compare to current labor laws?"
That question needs both. And the system should figure that out on its own, without the user specifying which source to query.
This is the core design challenge: intelligent routing. The system needs to classify the intent behind a question and dispatch it to the right agents — before any retrieval happens.
Architecture: A Graph, Not a Pipeline
My first instinct was a simple if/else chain: check for keywords, route accordingly. That broke down fast. Questions are ambiguous. "What's the policy on X?" could mean an internal document or a government regulation, depending on context.
Instead, I modeled the system as a directed graph using LangGraph, where each node is an autonomous agent with a single responsibility:
user question
│
▼
┌────────────┐
│ Orchestrate│ ← LLM classifies the question
└─────┬──────┘
│
│ conditional fan-out
├─────────────────────────┐
▼ ▼
┌───────────┐ ┌──────────────┐
│ RAG Agent │ │ Web Agent │
│ (FAISS) │ │(Tavily/DDG) │
└─────┬─────┘ └──────┬───────┘
│ │
└────────────┬────────────┘
▼
┌────────────┐
│ Synthesize │ ← combines all context
└────────────┘
Four nodes. One conditional edge. That's the entire system. LangGraph handles the rest — parallel execution, state synchronization, fan-in after fan-out.
Why a graph instead of a chain? Because chains are sequential by definition. When you need both RAG and web search, a chain forces you to pick an order. A graph lets both agents run simultaneously and converge when they're done.
The Orchestrator: Teaching an LLM to Route
The orchestrator is the brain of the system. It receives the user's question and classifies it into one of four routing strategies:
| Route | When | Agents activated |
|---|---|---|
| RAG | Question targets private/indexed documents | RAG only |
| WEB | Question needs fresh or general information | Web only |
| BOTH | Question spans local and external knowledge | RAG + Web in parallel |
| NONE | Trivial or conversational — no retrieval needed | Straight to synthesis |
The classification prompt is deliberately minimal. I ask the LLM to return exactly one word. No JSON parsing, no structured output format, no schema validation — just a single token that maps directly to a routing decision. This keeps latency low and reduces failure modes.
The critical insight here is that the orchestrator doesn't need to be right every time — it needs to be right enough. If it routes a RAG question to BOTH, the web results just get ignored during synthesis. If it routes a web question to RAG, the empty results trigger a graceful fallback. The system is designed to be robust to misclassification.
Shared State: The Glue Between Agents
LangGraph uses a TypedDict as shared state that flows through the graph. Here's what mine looks like:
python
class ResearchState(TypedDict):
question: str
needs_rag: bool
needs_web: bool
rag_results: Annotated[list[str], add]
web_results: Annotated[list[str], add]
output_mode: str
final_answer: Optional[str]
The Annotated[list[str], add] pattern is doing heavy lifting here. That add reducer tells LangGraph: when multiple nodes write to this field concurrently, concatenate the lists instead of overwriting. Without it, parallel fan-out would be a race condition — whichever agent finishes last would clobber the other's results.
This is one of those details that seems minor but is actually the foundation of the whole parallel execution model. Getting state management wrong in multi-agent systems is the number-one source of subtle bugs.
The RAG Agent: Local Knowledge via FAISS
The RAG agent handles document retrieval using a FAISS vector store with embeddings generated locally via Ollama (no external API calls for embeddings — everything stays private).
The indexing pipeline supports PDF, Markdown, plain text, and DOCX files. Documents get chunked with configurable size and overlap (defaults: 1000 characters, 200 overlap), embedded, and stored in a named collection. The collection system lets you segment different knowledge domains — one for contracts, another for policies, a third for archived web results.
That last part is worth highlighting: the system can re-index its own web search results into the RAG store. This means frequently-asked questions about external topics gradually become locally cached knowledge. It's a simple feedback loop, but it dramatically reduces redundant web searches over time.
The Web Agent: Dual-Source with Graceful Fallback
Web search uses Tavily as the primary source and DuckDuckGo as a zero-config fallback. If Tavily's API key isn't set, the system silently switches to DuckDuckGo — no configuration required, no error messages, it just works.
Raw web results are saved as timestamped Markdown files before being passed to the synthesizer. This serves two purposes: auditability (you can trace every answer back to its sources) and the re-indexing loop mentioned above.
The Synthesizer: From Fragments to Answers
The synthesize node receives all accumulated context — RAG chunks, web results, or both — and generates a final answer. The prompt instructs the LLM to weave multiple sources into a coherent response rather than just listing them.
This is where having a well-structured state pays off. The synthesizer doesn't need to know how the context was retrieved. It just sees an array of text chunks and composes a response. This decoupling means adding a new agent (database, API, whatever) requires zero changes to the synthesis logic.
Conditional Fan-Out: The Routing Function
The routing logic is a single function that returns a list of destination nodes:
python
def _route(state) -> list[str]:
destinations = []
if state["needs_rag"]:
destinations.append("rag_agent")
if state["needs_web"]:
destinations.append("web_agent")
return destinations or ["synthesize"]
Returning a list triggers LangGraph's parallel fan-out. Both agents execute concurrently, write their results to the shared state (using the add reducer), and LangGraph automatically waits for all branches to complete before executing synthesize. No manual synchronization, no callbacks, no promises to resolve.
The or ["synthesize"] fallback handles the NONE route — when the orchestrator decides no retrieval is needed, the question goes straight to synthesis.
Running Locally: No API Keys Required
A deliberate design choice: the entire system runs locally using Ollama. No OpenAI key, no cloud dependencies, no data leaving your machine. This matters for enterprise use cases where documents contain sensitive information.
bash
ollama pull llama3.2 # reasoning
ollama pull nomic-embed-text # embeddings
python main.py index ./docs/
python main.py research "What does the contract say about termination?"
Tavily is optional (adds higher-quality web search), but the system works fine with just DuckDuckGo out of the box.
What I'd Do Differently
Better evaluation. I built this system iteratively, testing routing accuracy by feel. In production, I'd want a labeled dataset of questions with expected routes, and automated evaluation of retrieval quality (recall, precision, relevance scores).
Streaming responses. The current version waits for complete synthesis before outputting anything. For longer answers, streaming the synthesizer's output token-by-token would dramatically improve perceived performance.
Agent memory. Right now each question is independent. Adding conversation memory would let the system handle follow-ups: "Tell me more about article 3" after a previous question about the contract.
Dynamic chunk sizing. Fixed-size chunking works, but it's a blunt instrument. For structured documents (contracts with numbered clauses, policies with sections), semantic chunking based on document structure would improve retrieval precision.
Patterns Worth Stealing
If you're building multi-agent systems, here are the patterns from this project that I think generalize well:
- LLM-as-router. Use a lightweight LLM call to classify intent before doing any expensive retrieval. It's faster and more flexible than rule-based routing.
-
Reducer-based state. LangGraph's
Annotatedreducers solve parallel state conflicts elegantly. Define how concurrent writes merge, and forget about synchronization. - Graceful degradation. Every external dependency has a fallback. Tavily fails → DuckDuckGo. No indexed documents → skip RAG. The system should always return something.
- Self-feeding loops. Save intermediate results (web searches, synthesized answers) in a format that can be re-indexed. Your system gets smarter over time without any additional training.
- Single-responsibility nodes. Each agent does one thing. This makes the system trivially extensible — adding a database agent is three lines of code plus the query logic.
The full source code is on GitHub. Questions, feedback, or ideas for new agents? I'd love to hear about them.
Top comments (0)