I open-sourced a 4-agent blood-panel triage workflow on heym, with a deterministic Python safety gate that runs BEFORE any LLM token

#ai #llm #opensource #mcp

I built a 4-agent multi-agent workflow on heym that turns a raw blood panel into a structured patient-education report. The architectural insight: a deterministic Python tool runs BEFORE any LLM token, and short-circuits to a fixed emergency output if any lab value crosses a hospital panic threshold. The LLM cannot soften what it never sees.

Repo: https://github.com/ejentum/agent-teams/tree/main/blood-panel-triage

The problem patient-facing medical AI has

If you point a stock LLM at a CBC and ask "what does this mean," you get the same failure spectrum every time:

Hallucinated diagnoses with fabricated reference ranges.
Sycophantic reassurance ("probably nothing to worry about"), the highest-cost failure in medicine because it delays care.
Diagnostic refusal ("I can't interpret medical data, see a doctor") with no useful information returned.
Missing emergencies: treating a 7.2 potassium the same as a 5.1 one, because the model has no mechanical anchor for "this number means call 911."

The hard problem isn't getting a model to interpret a lab value. The hard problem is getting it to STOP at the right places: no diagnosis, no false reassurance, no missed emergency. That's a behavior-shape problem, not a capability problem.

How the architecture solves it

Three layers, each addressing a different failure shape:

1. A deterministic Python safety gate runs before any LLM. A 12-marker hospital panic-value table classifies every value into critical | abnormal | normal. On any critical value the workflow emits a fixed emergency-output block and stops. No sub-agent is called. The model has no opportunity to soften the message because it never gets to write the message.

2. Role-locked sub-agents in parallel. For non-emergency panels, the orchestrator fans out to three specialists in a single turn. Each one's system prompt suppresses its most likely failure mode through hard rules (interpreter never advises, second-opinion never reassures, differential never picks most-likely).

3. Two error-reduction layers stacked. Cross-lab model diversity (Anthropic + Alibaba + DeepSeek) reduces correlated failures ACROSS labs. Ejentum cognitive harnesses attached per-sub-agent via MCP reduce failures WITHIN a model family.

The deterministic safety gate

The Python tool runs synchronously inside heym's tool sandbox. Pure stdlib, no network IO, JSON in / JSON out. Here's the panic-value table at the core:


python
PANIC = {
    "glucose":    {"crit_low": 40,   "crit_high": 600,  "ref_low": 70,   "ref_high": 100,  "unit": "mg/dL"},
    "potassium":  {"crit_low": 2.5,  "crit_high": 7.0,  "ref_low": 3.5,  "ref_high": 5.0,  "unit": "mEq/L"},
    "sodium":     {"crit_low": 120,  "crit_high": 160,  "ref_low": 135,  "ref_high": 145,  "unit": "mEq/L"},
    "hemoglobin": {"crit_low": 7.0,  "crit_high": 20.0, "ref_low": 12.0, "ref_high": 17.0, "unit": "g/dL"},
    "platelets":  {"crit_low": 20,   "crit_high": 1000, "ref_low": 150,  "ref_high": 450,  "unit": "x10^3/uL"},
    "wbc":        {"crit_low": 1.0,  "crit_high": 50.0, "ref_low": 4.0,  "ref_high": 11.0, "unit": "x10^3/uL"},
    "inr":        {"crit_low": None, "crit_high": 5.0,  "ref_low": 0.8,  "ref_high": 1.2,  "unit": "ratio"},
    "troponin":   {"crit_low": None, "crit_high": 0.04, "ref_low": 0.0,  "ref_high": 0.04, "unit": "ng/mL"},
    "creatinine": {"crit_low": None, "crit_high": 4.0,  "ref_low": 0.6,  "ref_high": 1.3,  "unit": "mg/dL"},
    "lactate":    {"crit_low": None, "crit_high": 4.0,  "ref_low": 0.5,  "ref_high": 2.2,  "unit": "mmol/L"},
    "calcium":    {"crit_low": 6.0,  "crit_high": 13.0, "ref_low": 8.5,  "ref_high": 10.5, "unit": "mg/dL"},
    "magnesium":  {"crit_low": 1.0,  "crit_high": 4.7,  "ref_low": 1.7,  "ref_high": 2.4,  "unit": "mg/dL"},
}
Thresholds are adult, non-pregnant defaults from standard US hospital lab callback policies. The tool returns a summary.requires_emergency_care: bool that the orchestrator reads directly. If true, fixed emergency output, stop. If false, fan out to sub-agents.

The parser handles free text ("Hemoglobin 8.5 g/dL, glucose 280") and JSON object strings ('{"hemoglobin": 8.5, "glucose": 280}') via a left-to-right tokenizer with multi-word alias matching (longest-first).

Role-locked sub-agents
Agent   Model   Cognitive layer Role
triageOrchAgent z-ai/glm-5.1    (none)  Safety gate + parallel fan-out + integration
interpreterAgent    qwen/qwen3-max-thinking ejentum-mcp Plain-language explainer per marker
doctorpushAgent anthropic/claude-opus-4 ejentum-mcp Specific questions to push the doctor on, no false reassurance
differentialAgent   deepseek/deepseek-r1    (none)  3-5 conditions consistent with pattern, each with confirm/rule-out
The orchestrator emits three call_sub_agent tool calls in a single assistant turn. heym detects parallel-eligible tool calls and runs them concurrently. Wall time on the fan-out is bounded by the slowest sub-agent, not the sum.

Public medical APIs wired as canvas tools
Three keyless HTTP endpoints attached to the right sub-agents:

Europe PMC for peer-reviewed literature grounding (https://www.ebi.ac.uk/europepmc/webservices/rest/search). Single-call returns title + abstract + authors + journal.
NIH Clinical Tables LOINC for authoritative lab test names (https://clinicaltables.nlm.nih.gov/api/loinc_items/v3/search).
NIH Clinical Tables conditions for verified condition terminology (https://clinicaltables.nlm.nih.gov/api/conditions/v3/search).
No fabricated citations, no made-up test names. Every reference the workflow surfaces traces to a public authoritative source.

ejentum-mcp via streamable_http
The two harnessed sub-agents attach the ejentum MCP server per-agent. Config block:


{
  "transport": "streamable_http",
  "url": "https://api.ejentum.com/mcp",
  "headers": "{\"Authorization\": \"Bearer YOUR_EJENTUM_API_KEY_HERE\"}",
  "timeout": 30,
  "label": "ejentum"
}
Use streamable_http, not stdio. The stdio path with npx -y ejentum-mcp has a cold-start delay inside heym's container that can return an empty tools list. streamable_http returns the four harness_* tools in roughly 200ms with no subprocess spawn.

Each sub-agent's HARD RULE 1 locks it to one harness (harness_reasoning for interpreter, harness_anti_deception for doctorpush) even though all four tools are visible. The scaffold returned per call contains failure-mode suppressors, target patterns, falsification tests, and Amplify: / Suppress: signals that bias the model's next-token distribution away from training-data defaults.

Try it
Clone the repo, open blood-panel-triage/heym/blood-panel-triage.json in your heym instance via Workflows → Import.
Configure model credentials (one OpenRouter key works for all four).
Paste the Python tool source from tools/check_critical_values.py into the triageOrchAgent's Python tool code field. Paste the parameters JSON Schema into the Parameters field (single balanced JSON object, no wrapper).
Attach the Ejentum MCP server to interpreterAgent and doctorpushAgent via the streamable_http block above.
Verify the three HTTP canvas tools are wired to their assigned sub-agents.
Run the verification test set in the README (realistic abnormal panel, emergency short-circuit, complex CRAB-minus-bone, no-input declination).
Free Ejentum tier: 100 calls. Free heym: self-hosted via Docker.

Known limitations
Documented honestly in the README:

The three HTTP nodes register the agent's query parameter but the URL itself is hardcoded in the node config, so the agent's query is currently discarded and the node returns the same initial result regardless. The agent (correctly) ignores irrelevant tool output and writes from MCP scaffold + reasoning alone, so output quality isn't degraded, but the HTTP tools are decorative until you wire agentProvidedFields=["curl"] on each node.
WBC and platelets in raw cells/uL ("WBC 22000" instead of "WBC 22") will trip false-positive critical flags. Document the units assumption in your patient-facing entry surface.
Wall time roughly 60-90s on a non-emergency panel; the claude-opus-4 second-opinion voice is the bottleneck.
What this is and isn't
Patient-education software, not a diagnostic tool. Not a replacement for a licensed clinician. The output is structured information to help a patient understand their values and prepare for a clinical conversation. The deterministic emergency-gate exists to make sure no panic value ever gets soft-pedaled by an LLM. Everything past the gate is explicitly framed as "questions to ask the doctor" and "conditions consistent with this pattern," not "you have X."

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.