Alechko

Posted on Feb 24

Building patient cohorts from messy medical data with Elasticsearch Agent Builder

#elasticsearch #healthcare #ai #hackathon

Over 80% of clinical trials fail to meet enrollment deadlines (PMC). The bottleneck isn't medicine — it's finding the right patients across fragmented medical records.

This article describes Medical Cohort Agent, an AI system built with Elasticsearch Agent Builder that turns a researcher's natural language question into a normalized, queryable patient cohort — handling schema variance, OCR artifacts, and missing data across multiple healthcare facilities.

This agent doesn't answer questions — it creates artifacts.

The Problem: Schema Variance Is the Real Enemy

We work with one of Israel's largest HMOs — an organization serving millions of patients across hundreds of facilities. Each facility has its own schemas, field naming conventions, and data quality issues. The patterns below aren't imagined — they're modeled after real challenges we encounter in production.

Concept	Hospital A	Hospital B	Lab Chain	Clinic Network
Age	`patient_age: 67`	`גיל: 67`	`age: 67`	`age_group: "60-70"`
Date	`2024-03-15`	`15/03/2024`	`15.03.24`	`2024-03`
Smoking	`smoking_status: true`	(not tracked)	(not tracked)	`smoking: true`
Notes	`clinical_notes`	`סיכום_רפואי`	`notes`	`text`

Add OCR artifacts from scanned documents — סוכדת instead of סוכרת (diabetes misspelled by scanner) — and exact keyword matching fails silently.

The bottleneck isn't data volume. It's the variance.

The Solution: Two-Layer Architecture

The key insight: separate judgment from execution.

Layer 1 — Agent (Judgment)

The Elasticsearch Agent Builder agent handles schema discovery and criteria planning:

Discovers all indices and their mappings via platform tools (list_indices, get_index_mapping, search)
Builds per-facility field maps (which field holds "age", which holds "conditions")
Plans structured criteria from the natural language question
Explains data gaps and caveats BEFORE execution — e.g. "Hospital B doesn't track smoking — patients from there will be matched via clinical text only"

Layer 2 — Elastic Workflow (Execution)

A deterministic workflow normalizes data — no LLM in the loop:

Strict pass: iterates over facilities via foreach, applies a single parameterized Painless script that normalizes using the agent's field maps
Semantic kNN pass: uses E5-large embeddings (1024-dim) to find patients whose clinical text matches the research question semantically — catches OCR artifacts, synonyms, and negation
Count + breakdown: reports totals per facility and confidence level

The output is a persistent cohort_<name> index — not a chat response.

Why Not Just a Chatbot?

A chatbot answers questions. This agent creates artifacts.

The cohort index persists after the conversation ends. Researchers can:

Query it with ES|QL for follow-up analysis
Filter by match_confidence: strict vs probable
Inspect per-match provenance (source_field_map, match_explanation, knn_score)
Share it with colleagues
Build on it without re-processing raw data

The workflow is deterministic and auditable. Adding a new facility requires no code changes — only the agent discovering and mapping its schema.

Normalization Impact

Reproducible metrics from the sample corpus (10 indices, 4 facilities):

Metric	Before	After
Field names per concept	5+ variants	1 unified
Date formats	4 different	1 canonical
Patient ID types	string, int, float	normalized string
OCR-corrupted clinical terms	invisible to search	captured via kNN
Structured/text contradictions	~18% undetected	classified with confidence

The metrics script is included in the repo: python3 scripts/metrics.py --data-dir sample_data

Air-Gapped Deployment

Healthcare data cannot leave the network. The entire stack runs on a single VM with no internet dependency:

LLM: Ollama + Llama 4 (local inference)
Embeddings: E5-large via Ollama (1024-dim vectors)
Stack: Elasticsearch 9.3 + Kibana (Agent Builder + Workflows)

The architecture is LLM-agnostic — swap Ollama for any OpenAI-compatible provider without code changes.

Synthetic Data: e2llm-medsynth

Real patient data can't be used for demos. We built the data generator as a standalone open-source tool:

pip install e2llm-medsynth
e2llm-medsynth --verbose --output-dir output

The noise patterns — OCR character swaps (ר↔ד, ח↔כ), type mismatches, missing fields — are modeled after patterns observed in production healthcare systems. MedSynth makes them reproducible for anyone.

Supports 6 locales (he_IL, ar_SA, ar_EG, es_ES, es_MX, es_AR). MIT licensed.

📦 e2llm-medsynth on PyPI · GitHub

Demo

A researcher types in Hebrew:

מצא חולי סוכרת מעל גיל 60 שמעשנים
(Find diabetic patients over 60 who smoke)

The agent discovers schemas, builds field maps, explains that Hospital B doesn't track smoking (probable matches only via clinical text), then triggers the workflow.

Result: 96 patients from 10 indices, classified by confidence:

Strict: 36 (structured field match)
Probable: 60 (semantic similarity of clinical text)

The researcher continues:

באילו מחלקות טופלו?
(Which departments were they treated in?)

The agent queries the cohort index directly with ES|QL — no re-processing of raw data.

הראה לי את ההתאמות הסבירות - למה הן לא ודאיות?
(Show me the probable matches — why aren't they certain?)

The agent queries match_confidence: "probable" and returns per-patient explanations with kNN scores and match reasoning.

Try It

cp .env.example .env
bash scripts/demo_setup.sh --embed

Open Kibana → Agent Builder → Medical Cohort Agent → ask a question.

Two open-source tools came out of this project:

🔗 Medical Cohort Agent: REPO LINK
🧬 MedSynth (synthetic data generator): GitHub · PyPI

Built with Elasticsearch Agent Builder + Elastic Workflows for the Elasticsearch Agent Builder Hackathon.

DEV Community