DEV Community

Alechko
Alechko

Posted on

Building patient cohorts from messy medical data with Elasticsearch Agent Builder

he bottleneck isn't medicine — it's finding the right patients across fragmented medical records.Over 80% of clinical trials fail to meet enrollment deadlines (PMC). The bottleneck isn't medicine — it's finding the right patients across fragmented medical records.

This article describes Medical Cohort Agent, an AI system built with Elasticsearch Agent Builder that turns a researcher's natural language question into a normalized, queryable patient cohort — handling schema variance, OCR artifacts, and missing data across multiple healthcare facilities.

This agent doesn't answer questions — it creates artifacts.

The Problem: Schema Variance Is the Real Enemy

We work with one of Israel's largest HMOs — an organization serving millions of patients across hundreds of facilities. Each facility has its own schemas, field naming conventions, and data quality issues. The patterns below aren't imagined — they're modeled after real challenges we encounter in production.

Concept Hospital A Hospital B Lab Chain Clinic Network
Age patient_age: 67 גיל: 67 age: 67 age_group: "60-70"
Date 2024-03-15 15/03/2024 15.03.24 2024-03
Smoking smoking_status: true (not tracked) (not tracked) smoking: true
Notes clinical_notes סיכום_רפואי notes text

Add OCR artifacts from scanned documents — סוכדת instead of סוכרת (diabetes misspelled by scanner) — and exact keyword matching fails silently.

The bottleneck isn't data volume. It's the variance.

The Solution: Two-Layer Architecture

The key insight: separate judgment from execution.

Layer 1 — Agent (Judgment)

The Elasticsearch Agent Builder agent handles schema discovery and criteria planning:

  • Discovers all indices and their mappings via platform tools (list_indices, get_index_mapping, search)
  • Builds per-facility field maps (which field holds "age", which holds "conditions")
  • Plans structured criteria from the natural language question
  • Explains data gaps and caveats BEFORE execution — e.g. "Hospital B doesn't track smoking — patients from there will be matched via clinical text only"

Layer 2 — Elastic Workflow (Execution)

A deterministic workflow normalizes data — no LLM in the loop:

  • Strict pass: iterates over facilities via foreach, applies a single parameterized Painless script that normalizes using the agent's field maps
  • Semantic kNN pass: uses E5-large embeddings (1024-dim) to find patients whose clinical text matches the research question semantically — catches OCR artifacts, synonyms, and negation
  • Count + breakdown: reports totals per facility and confidence level

The output is a persistent cohort_<name> index — not a chat response.

Why Not Just a Chatbot?

A chatbot answers questions. This agent creates artifacts.

The cohort index persists after the conversation ends. Researchers can:

  • Query it with ES|QL for follow-up analysis
  • Filter by match_confidence: strict vs probable
  • Inspect per-match provenance (source_field_map, match_explanation, knn_score)
  • Share it with colleagues
  • Build on it without re-processing raw data

The workflow is deterministic and auditable. Adding a new facility requires no code changes — only the agent discovering and mapping its schema.

Normalization Impact

Reproducible metrics from the sample corpus (10 indices, 4 facilities):

Metric Before After
Field names per concept 5+ variants 1 unified
Date formats 4 different 1 canonical
Patient ID types string, int, float normalized string
OCR-corrupted clinical terms invisible to search captured via kNN
Structured/text contradictions ~18% undetected classified with confidence

The metrics script is included in the repo: python3 scripts/metrics.py --data-dir sample_data

Air-Gapped Deployment

Healthcare data cannot leave the network. The entire stack runs on a single VM with no internet dependency:

  • LLM: Ollama + Llama 4 (local inference)
  • Embeddings: E5-large via Ollama (1024-dim vectors)
  • Stack: Elasticsearch 9.3 + Kibana (Agent Builder + Workflows)

The architecture is LLM-agnostic — swap Ollama for any OpenAI-compatible provider without code changes.

Synthetic Data: e2llm-medsynth

Real patient data can't be used for demos. We built the data generator as a standalone open-source tool:

pip install e2llm-medsynth
e2llm-medsynth --verbose --output-dir output
Enter fullscreen mode Exit fullscreen mode

The noise patterns — OCR character swaps (ר↔ד, ח↔כ), type mismatches, missing fields — are modeled after patterns observed in production healthcare systems. MedSynth makes them reproducible for anyone.

Supports 6 locales (he_IL, ar_SA, ar_EG, es_ES, es_MX, es_AR). MIT licensed.

📦 e2llm-medsynth on PyPI · GitHub

Demo

A researcher types in Hebrew:

מצא חולי סוכרת מעל גיל 60 שמעשנים
(Find diabetic patients over 60 who smoke)

The agent discovers schemas, builds field maps, explains that Hospital B doesn't track smoking (probable matches only via clinical text), then triggers the workflow.

Result: 96 patients from 10 indices, classified by confidence:

  • Strict: 36 (structured field match)
  • Probable: 60 (semantic similarity of clinical text)

The researcher continues:

באילו מחלקות טופלו?
(Which departments were they treated in?)

The agent queries the cohort index directly with ES|QL — no re-processing of raw data.

הראה לי את ההתאמות הסבירות - למה הן לא ודאיות?
(Show me the probable matches — why aren't they certain?)

The agent queries match_confidence: "probable" and returns per-patient explanations with kNN scores and match reasoning.

Try It

cp .env.example .env
bash scripts/demo_setup.sh --embed
Enter fullscreen mode Exit fullscreen mode

Open Kibana → Agent Builder → Medical Cohort Agent → ask a question.

Two open-source tools came out of this project:

  • 🔗 Medical Cohort Agent: REPO LINK
  • 🧬 MedSynth (synthetic data generator): GitHub · PyPI

Built with Elasticsearch Agent Builder + Elastic Workflows for the Elasticsearch Agent Builder Hackathon.

Top comments (0)