Building a Multi-Agent Clinical Research System with LangGraph, DeepSeek and Tavily
Imagine asking a complex medical question and having a network of AI agents research the web, critique the quality of their own findings, and only write the final report after you personally approve the evidence. That is exactly what I built.
In this article I'll walk through the architecture, the key decisions, and the code that makes it work.
The Problem
Clinicians, researchers, and health operators spend hours reviewing scientific literature before making decisions. The question was: can a multi-agent AI system do this reliably, with quality control built in?
The Architecture
The system runs a LangGraph StateGraph with 4 specialized agents:
- Orchestrator — initializes state and coordinates the flow
- Researcher — searches the web via Tavily and PubMed (up to 3 rounds)
- Critic — evaluates data quality and loops back if insufficient
- Writer — generates the final structured clinical report
Why LangGraph over CrewAI?
LangGraph gives explicit control over graph edges, state transitions, and interrupt points. In a clinical context, predictability matters more than convenience. CrewAI abstracts too much — LangGraph lets you see exactly what runs, when, and why.
The State
The most important design decision in the state is the operator.add annotation on research_data. LangGraph overwrites state fields by default. Without this, each research round would erase the previous one. The annotation tells LangGraph to append instead — building cumulative context across revision cycles.
from typing import TypedDict, List, Dict, Any, Annotated
import operator
class AgentState(TypedDict):
query: str
research_data: Annotated[List[Dict[str, Any]], operator.add]
critic_feedback: str
revision_count: int
final_report: str
is_sufficient: bool
The Critic Agent — The Key Design Decision
The Critic is what separates this system from a simple RAG pipeline. It receives the research data and asks: Is the evidence recent enough? Are there contradictions between sources? Is the population studied the right one?
async def critic_node(state: AgentState):
prompt = f"""You are a Medical Critic.
Evaluate if this research is sufficient for the query.
Query: {state['query']}
Research Data: {state['research_data']}
Respond ONLY with valid JSON:
{{"is_sufficient": true, "feedback": "your feedback"}}"""
res = llm_deepseek.invoke([HumanMessage(content=prompt)])
parsed = json.loads(res.content.strip())
return {
"is_sufficient": parsed.get("is_sufficient", False),
"critic_feedback": parsed.get("feedback", "")
}
If the Critic returns is_sufficient: false, the graph sends the Researcher back for another round — up to 3 times maximum to prevent infinite loops.
Human-in-the-Loop
This is the most critical safety feature. Before the Writer generates the final report, the graph pauses and presents the collected evidence to the user.
clinical_app = workflow.compile(
checkpointer=MemorySaver(),
interrupt_before=["writer"]
)
state = await clinical_app.aget_state(config)
ans = input("Approve research and generate report? (y/n): ")
if ans == 'y':
async for event in clinical_app.astream(None, config=config):
if "writer" in event:
report = event["writer"]["final_report"]
When the user types y, the graph resumes from the checkpoint using MemorySaver — no data is lost between the pause and the resume.
Real Test Query
I ran the system with this query:
"Latest evidence on semaglutide for obesity treatment in CKD patients?"
The Critic agent referenced the 2024 FLOW trial in its evaluation — the AI didn't just search, it questioned the quality and specificity of what it found. The final report is available in the repository as clinical_report.md.
Terminal Output
🎯 Orchestrator: Iniciando pipeline...
🔬 Researcher: Rodada 1 de pesquisa...
🔧 Usando ferramenta: tavily_search
🔧 Usando ferramenta: pubmed_mock_search
🧐 Critic: Avaliando qualidade dos dados...
✅ Suficiente: True
--- Executing: __interrupt__ ---
Aprovar pesquisa e gerar relatório? (y/n): y
✍️ Writer: Gerando relatório clínico...
✅ Relatório gerado: clinical_report.md
Stack
- LangGraph — stateful multi-agent orchestration
- LangChain — LLM abstractions and tool calling
- DeepSeek — cost-efficient OpenAI-compatible LLM
- Tavily — AI-optimized real web search
- Pydantic — structured outputs
- Python 3.14
Full Repository
Complete code, architecture decisions, and the AI-generated clinical report example:
🧬 LangGraph Clinical Research Orchestrator
Multi-Agent AI system for clinical evidence surveillance.
Built with LangGraph, DeepSeek and Tavily.
How it works
You ask a complex medical question. A network of AI agents researches, critiques and only writes the report when the data quality is approved — by both the Critic agent and a human reviewer.
Agent Flow
graph TD
A([START]) --> B[🎯 Orchestrator]
B --> C[🔬 Researcher\nTavily + PubMed]
C --> D[🧐 Critic\nEvaluates quality]
D -->|Not sufficient| C
D -->|Approved| E{👤 Human Review\nHITL}
E -->|y| F[✍️ Writer\nMarkdown Report]
F --> G([END])
Agents
| Agent | Role |
|---|---|
| Orchestrator | Initializes state and coordinates the flow |
| Researcher | Searches web via Tavily + PubMed mock |
| Critic | Evaluates data quality — loops back if insufficient |
| Writer | Generates final structured clinical report |
Key Technical Decisions
- LangGraph over CrewAI — explicit control over edges, state and interrupts
-
operator.addonresearch_data— append-only accumulation across revisions -
interrupt_before=["writer"]— human…
What's Next
- Real PubMed API integration (replace mock)
- FastAPI endpoint to serve queries via HTTP
- Weekly monitoring mode with alerts on new studies
- Multi-condition support — configurable per user
- Output variants — technical for doctors, simplified for patients
If you're building with LangGraph or multi-agent systems, I'd love to connect. Drop a comment below or find me on LinkedIn.
Top comments (0)