From 60% to 84%: Building an AI Agent for Public Health Data

#ai #healthcare #python #opensource

The fixes that actually worked weren't about prompts.

We built SaludAI, an open-source AI agent that takes clinical questions in natural language and queries a FHIR R4 server. Think: "How many patients with type 2 diabetes over 60 are in Buenos Aires?" — the agent resolves the terminology, builds the FHIR queries, follows references across resource types, and returns a traceable answer.

No LangChain. The agent loop is ~300 lines of Python, every step logged to Langfuse. When something breaks, we can read exactly why.

We benchmarked it: 100 questions, 200 synthetic Argentine patients, 10 FHIR resource types, 4 terminology systems. Inspired by FHIR-AgentBench (Verily/KAIST/MIT) — but on synthetic data, so the scores aren't directly comparable to their clinical dataset.

Here's how accuracy evolved, and what each fix taught us.

60% → 82%: The agent wasn't seeing the data

Our FHIR client returned only the first page. With 687 immunizations and 437 encounters across 200 patients, most counts came back wrong. The fix was auto-pagination (search_all()). Boring. Fixed 11 questions overnight.

Before optimizing the AI, make sure it sees all the data.

82% → 94%: Give LLMs a calculator

Questions like "average age of patients with hypertension" forced the LLM to do arithmetic in its head. It's bad at this. We added execute_code — a sandboxed Python environment. The agent writes len([e for e in entries if ...]) instead of trying to count mentally.

+12 points from one tool.

94% → 79% → 89%: Scaling broke everything

We went from 50 questions to 100, from 55 patients to 200. Accuracy dropped 15 points. The 94% was fragile.

The problem: a pure ReAct loop with no plan. Complex questions needing 3-4 resource traversals left the agent wandering. We added a query planner — lightweight FHIR knowledge graph, 11 query patterns, and action space reduction: instead of telling the agent "don't use search_fhir for counting," we remove the tool entirely. The agent can't misuse what it can't see.

5 LLMs, same infrastructure

Model	Accuracy	Simple	Medium	Complex
Claude Sonnet 4.5	84%	94%	93%	72%
Claude Haiku 4.5	77%	100%	80%	65%
GPT-4o	63%	100%	73%	40%
Llama 3.3 70B	48%	94%	63%	16%
Qwen 3.5 9B	25%	50%	29%	12%

Simple questions are a commodity — every model above 9B gets 94%+. The gap is in complex multi-hop reasoning. And the planner + tool design is what makes Sonnet hit 84% instead of the ~60% you'd get with a naive loop.

One surprise: schema flattening fixed GPT-4o (Simple: 53% → 100%) but broke Qwen (29% → 13%). Every change needs validation across models.

What stuck

Benchmark everything. Our first "88%" was on 25 easy questions. The honest baseline was 60%.

Analyze per question, not averages. "82% accuracy" tells you nothing. "11 failures are truncated data, 3 are arithmetic errors" tells you what to fix.

Tool design > prompt engineering. execute_code was worth +12pp. No prompt tweak comes close.

Try it

git clone https://github.com/saludai-labs/saludai.git
cd saludai && uv sync
docker compose up -d
uv run saludai query "¿Cuántos pacientes tienen diabetes tipo 2?"