<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Frank Brsrk </title>
    <description>The latest articles on DEV Community by Frank Brsrk  (@frank_brsrk).</description>
    <link>https://dev.to/frank_brsrk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3885887%2F309f7210-d679-4c7e-b6d4-8c2eb62450ab.png</url>
      <title>DEV Community: Frank Brsrk </title>
      <link>https://dev.to/frank_brsrk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/frank_brsrk"/>
    <language>en</language>
    <item>
      <title>What if, mid-task the agent could get a self-check bump that surfaces the silent assumptions of your itself.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Thu, 11 Jun 2026 11:13:07 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/what-if-mid-task-the-agent-could-get-a-self-check-bump-that-surfaces-the-silent-assumptions-of-1n4d</link>
      <guid>https://dev.to/frank_brsrk/what-if-mid-task-the-agent-could-get-a-self-check-bump-that-surfaces-the-silent-assumptions-of-1n4d</guid>
      <description>&lt;h1&gt;
  
  
  I measured what one self-check per turn changes in a coding agent
&lt;/h1&gt;

&lt;p&gt;I ran a real eval on Self-Inspect (the keyless metathought tool), and the result is&lt;br&gt;
worth writing up because it splits cleanly into what moved and what didn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Two coding agents build the same usage-billing module over a fixed 30-turn&lt;br&gt;
conversation with a product manager whose requirements pile up and quietly&lt;br&gt;
contradict earlier ones. One agent consults Self-Inspect once per turn; the other&lt;br&gt;
never does.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model: Claude Sonnet 4.6, four agents (two per condition).&lt;/li&gt;
&lt;li&gt;The base prompt is byte-identical across both conditions. No "be careful," no
"watch for edge cases." The only difference is the one call.&lt;/li&gt;
&lt;li&gt;Scored on whether each turn's reply surfaces a decision-fork: an assumption,
precondition, edge case, or risk it raises rather than silently choosing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What moved: ~3.5x more forks surfaced
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yvqrelwpgmv2yvlq0od.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7yvqrelwpgmv2yvlq0od.png" alt=" " width="799" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>mcp</category>
    </item>
    <item>
      <title>I built a self-inspection tool for AI agents with no AI inside it</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Fri, 05 Jun 2026 15:52:40 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/i-built-a-self-inspection-tool-for-ai-agents-with-no-ai-inside-it-3jie</link>
      <guid>https://dev.to/frank_brsrk/i-built-a-self-inspection-tool-for-ai-agents-with-no-ai-inside-it-3jie</guid>
      <description>&lt;p&gt;There's a small voice that asks "wait, are you sure?" right before you do something dumb. AI agents don't have that voice.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbx9rlerjg2wcml7qdv4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbx9rlerjg2wcml7qdv4.png" alt=" " width="800" height="734"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So they commit to the first read of a task and never reopen it. They pile up assumptions they never name. They drift from the goal over a long chain, stop at the first answer that looks right, and get more confident without getting more evidence. None of it is a knowledge failure. The model already knows better. It just never stops to ask.&lt;/p&gt;

&lt;p&gt;And here's the catch: an agent can't reliably grow that voice on its own. The part that decides what to reflect on is the same part that's already committed, so prompting a model to "double-check itself" mostly re-runs the same bias and calls it confidence. The question has to come from outside the model.&lt;/p&gt;

&lt;p&gt;That's what I built. It's called Self-Inspect.&lt;/p&gt;

&lt;p&gt;What it does&lt;br&gt;
The agent sends a thought, a sentence about what it's doing or about to do. It gets back one metathought: a short question that makes it inspect its own task and assumptions before continuing.&lt;/p&gt;

&lt;p&gt;thought:     "I'm committing to this architecture and treating it as fixed"&lt;br&gt;
metathought: "What is fixed?"                 (lens: commitment)&lt;/p&gt;

&lt;p&gt;thought:     "what depends on this being true?"&lt;br&gt;
metathought: "What depends on being true?"    (lens: assumption)&lt;br&gt;
Not advice. Not an answer. A question. The agent still does the thinking; the tool just makes it look.&lt;/p&gt;

&lt;p&gt;The part people don't expect: there's no model in it&lt;br&gt;
No LLM, no embeddings, no semantic similarity. Selection is a small deterministic heuristic over an open CSV of ~50 "inspection lenses" (137 questions: assumption, confidence, scope, drift, completeness, and so on).&lt;/p&gt;

&lt;p&gt;Given a thought, it normalizes the text, scores each lens by keyword overlap with the lens name and that lens's questions, picks the best lens, and returns its canonical question. Same input, same question, every time. You can read the selector (one small file) and the CSV and know exactly why it returned what it did. It can't hallucinate its own critique, because there's nothing in it that hallucinates.&lt;/p&gt;

&lt;p&gt;If nothing matches, it doesn't fail. It returns a universal question about task and assumptions, because a tool called Self-Inspect should always give the agent something to question.&lt;/p&gt;

&lt;p&gt;Use it (keyless, free)&lt;br&gt;
REST, from anything:&lt;/p&gt;

&lt;p&gt;curl -s -X POST &lt;a href="https://api.ejentum.com/self-inspect" rel="noopener noreferrer"&gt;https://api.ejentum.com/self-inspect&lt;/a&gt; \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{"thought":"I am committing to this architecture and treating it as fixed"}'&lt;/p&gt;

&lt;h1&gt;
  
  
  -&amp;gt; [{ "label": "commitment", "metathought": "What is fixed?" }]
&lt;/h1&gt;

&lt;p&gt;As an MCP server (Claude Code, Cursor, n8n, any HTTP-MCP client), no install, no key:&lt;/p&gt;

&lt;p&gt;{&lt;br&gt;
  "mcpServers": {&lt;br&gt;
    "self-inspect": {&lt;br&gt;
      "type": "http",&lt;br&gt;
      "url": "&lt;a href="https://api.ejentum.com/self-inspect-mcp" rel="noopener noreferrer"&gt;https://api.ejentum.com/self-inspect-mcp&lt;/a&gt;"&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
There's also a stdio package that runs the whole thing offline (SELF_INSPECT_LOCAL=1), no network at all.&lt;/p&gt;

&lt;p&gt;Open all the way down&lt;br&gt;
The repo is the tool. The CSV and the selector you read there are the exact logic the live endpoint runs. A drift test fails the build if the deployed engine isn't byte-identical to the published source, so "published == deployed" isn't a promise, it's enforced.&lt;/p&gt;

&lt;p&gt;It's deliberately dumb so it stays fully auditable. Whether keyword routing is good enough versus an embedding-based router is a fair question, and one I'd genuinely like feedback on. The lens set is a CSV; adding to it is a pull request.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/ejentum/self-inspect-mcp" rel="noopener noreferrer"&gt;https://github.com/ejentum/self-inspect-mcp&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's the first open-source tool I've shipped from Ejentum. Drop it in your agent loop right before it commits to something, and tell me what it asks of yours.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>llm</category>
      <category>open</category>
    </item>
    <item>
      <title>From dynamic to adaptive: rewriting an agent's reasoning operation to its exact task at runtime</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Tue, 02 Jun 2026 10:43:56 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/from-dynamic-to-adaptive-rewriting-an-agents-reasoning-operation-to-its-exact-task-at-runtime-f6g</link>
      <guid>https://dev.to/frank_brsrk/from-dynamic-to-adaptive-rewriting-an-agents-reasoning-operation-to-its-exact-task-at-runtime-f6g</guid>
      <description>&lt;p&gt;I shipped adaptive mode for the Ejentum reasoning harness. Here's what changed and why it matters if you build agents.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07kebxnm6qg50min8jwg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F07kebxnm6qg50min8jwg.png" alt=" " width="800" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The harness, in one paragraph&lt;br&gt;
Your agent calls a tool mid-task and gets back an engineered cognitive operation: the failure to avoid, a step-by-step procedure, a reasoning topology (a DAG of steps and decision gates), suppression signals, a falsification test, and a target pattern. The agent reads that before it answers. It is not RAG (that fixes what the model knows) and not chain-of-thought (that just makes reasoning visible). It governs which reasoning steps to take and which failure modes to block.&lt;/p&gt;

&lt;p&gt;What was there: dynamic&lt;br&gt;
Dynamic delivery does a single retrieval. The highest-scoring operation comes back as engineered. It is predictable and fast, but the operation is canonical, written against a generic example, not against the problem in front of your agent.&lt;/p&gt;

&lt;p&gt;What's new: adaptive&lt;br&gt;
Adaptive does top-k retrieval, then an adapter LLM rewrites two fields, the procedure steps and the reasoning topology, with the identifiers from your actual task. The agent gets a cognitive map fitted to its exact problem instead of a template.&lt;/p&gt;

&lt;p&gt;The part I care about most: the safety-critical fields (negative gate, suppression signals, falsification test) are withheld from the adapter and stitched back in by code after it runs. The adapter only ever sees the task, the procedure, and the topology. So adaptive can fit the reasoning to your task but cannot weaken the guardrails. That property is structural, not a promise.&lt;/p&gt;

&lt;p&gt;Live&lt;br&gt;
The screenshot is a Claude instance calling adaptive-reasoning mid-task. The returned topology referenced the actual work it was doing (S1:decompose(ejentum_portrait -&amp;gt; components ...), cognitive style recursive self model), not a canonical example. That is the whole point: the map matched the territory.&lt;/p&gt;

&lt;p&gt;Use it&lt;br&gt;
REST:&lt;/p&gt;

&lt;p&gt;curl -X POST "&lt;a href="https://api.ejentum.com/harness/" rel="noopener noreferrer"&gt;https://api.ejentum.com/harness/&lt;/a&gt;" \&lt;br&gt;
  -H "Authorization: Bearer YOUR_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{"query": "diagnose why a microservice returns 503s under load", "mode": "adaptive-reasoning"}'&lt;br&gt;
MCP: point any MCP client at &lt;a href="https://api.ejentum.com/mcp" rel="noopener noreferrer"&gt;https://api.ejentum.com/mcp&lt;/a&gt; and the tool harness_reasoning_adaptive appears (plus dynamic and the other three harnesses).&lt;/p&gt;

&lt;p&gt;Honest notes&lt;br&gt;
Adaptive adds one adapter round-trip, so it is a couple seconds slower than dynamic. Use it when the task is novel enough that a generic operation would not fit.&lt;br&gt;
It draws on a separate adaptive call pool. The 30-day free trial is dynamic-only.&lt;br&gt;
If the API is unreachable, your agent continues on native reasoning. It is an enhancement layer, not a dependency.&lt;br&gt;
Same one tool call. From an abstract, fixed operation to an adaptive cognitive map, on the fly.&lt;/p&gt;

&lt;p&gt;Harness the full reasoning power of your agents. → &lt;a href="https://api.ejentum.com/mcp" rel="noopener noreferrer"&gt;https://api.ejentum.com/mcp&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/ejentum/ejentum-mcp" rel="noopener noreferrer"&gt;https://github.com/ejentum/ejentum-mcp&lt;/a&gt;&lt;br&gt;
Quickstart: &lt;a href="https://ejentum.com/docs/quickstart" rel="noopener noreferrer"&gt;https://ejentum.com/docs/quickstart&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>mcp</category>
      <category>agents</category>
    </item>
    <item>
      <title>I open-sourced a 4-agent blood-panel triage workflow on heym, with a deterministic Python safety gate that runs BEFORE any LLM token</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Sun, 24 May 2026 15:59:29 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/i-open-sourced-a-4-agent-blood-panel-triage-workflow-on-heym-with-a-deterministic-python-safety-1bhn</link>
      <guid>https://dev.to/frank_brsrk/i-open-sourced-a-4-agent-blood-panel-triage-workflow-on-heym-with-a-deterministic-python-safety-1bhn</guid>
      <description>&lt;p&gt;I built a 4-agent multi-agent workflow on &lt;a href="https://heym.run" rel="noopener noreferrer"&gt;heym&lt;/a&gt; that turns a raw blood panel into a structured patient-education report. The architectural insight: a deterministic Python tool runs BEFORE any LLM token, and short-circuits to a fixed emergency output if any lab value crosses a hospital panic threshold. The LLM cannot soften what it never sees.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjgdcc908r7b6lv8cmm4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvjgdcc908r7b6lv8cmm4.png" alt=" " width="800" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/ejentum/agent-teams/tree/main/blood-panel-triage" rel="noopener noreferrer"&gt;https://github.com/ejentum/agent-teams/tree/main/blood-panel-triage&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem patient-facing medical AI has
&lt;/h2&gt;

&lt;p&gt;If you point a stock LLM at a CBC and ask "what does this mean," you get the same failure spectrum every time:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hallucinated diagnoses&lt;/strong&gt; with fabricated reference ranges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sycophantic reassurance&lt;/strong&gt; ("probably nothing to worry about"), the highest-cost failure in medicine because it delays care.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnostic refusal&lt;/strong&gt; ("I can't interpret medical data, see a doctor") with no useful information returned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing emergencies&lt;/strong&gt;: treating a 7.2 potassium the same as a 5.1 one, because the model has no mechanical anchor for "this number means call 911."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard problem isn't getting a model to interpret a lab value. The hard problem is getting it to &lt;strong&gt;STOP at the right places&lt;/strong&gt;: no diagnosis, no false reassurance, no missed emergency. That's a behavior-shape problem, not a capability problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the architecture solves it
&lt;/h2&gt;

&lt;p&gt;Three layers, each addressing a different failure shape:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. A deterministic Python safety gate runs before any LLM.&lt;/strong&gt; A 12-marker hospital panic-value table classifies every value into &lt;code&gt;critical | abnormal | normal&lt;/code&gt;. On any critical value the workflow emits a fixed emergency-output block and stops. No sub-agent is called. The model has no opportunity to soften the message because it never gets to write the message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Role-locked sub-agents in parallel.&lt;/strong&gt; For non-emergency panels, the orchestrator fans out to three specialists in a single turn. Each one's system prompt suppresses its most likely failure mode through hard rules (interpreter never advises, second-opinion never reassures, differential never picks most-likely).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Two error-reduction layers stacked.&lt;/strong&gt; Cross-lab model diversity (Anthropic + Alibaba + DeepSeek) reduces correlated failures ACROSS labs. Ejentum cognitive harnesses attached per-sub-agent via MCP reduce failures WITHIN a model family.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deterministic safety gate
&lt;/h2&gt;

&lt;p&gt;The Python tool runs synchronously inside heym's tool sandbox. Pure stdlib, no network IO, JSON in / JSON out. Here's the panic-value table at the core:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
PANIC = {
    "glucose":    {"crit_low": 40,   "crit_high": 600,  "ref_low": 70,   "ref_high": 100,  "unit": "mg/dL"},
    "potassium":  {"crit_low": 2.5,  "crit_high": 7.0,  "ref_low": 3.5,  "ref_high": 5.0,  "unit": "mEq/L"},
    "sodium":     {"crit_low": 120,  "crit_high": 160,  "ref_low": 135,  "ref_high": 145,  "unit": "mEq/L"},
    "hemoglobin": {"crit_low": 7.0,  "crit_high": 20.0, "ref_low": 12.0, "ref_high": 17.0, "unit": "g/dL"},
    "platelets":  {"crit_low": 20,   "crit_high": 1000, "ref_low": 150,  "ref_high": 450,  "unit": "x10^3/uL"},
    "wbc":        {"crit_low": 1.0,  "crit_high": 50.0, "ref_low": 4.0,  "ref_high": 11.0, "unit": "x10^3/uL"},
    "inr":        {"crit_low": None, "crit_high": 5.0,  "ref_low": 0.8,  "ref_high": 1.2,  "unit": "ratio"},
    "troponin":   {"crit_low": None, "crit_high": 0.04, "ref_low": 0.0,  "ref_high": 0.04, "unit": "ng/mL"},
    "creatinine": {"crit_low": None, "crit_high": 4.0,  "ref_low": 0.6,  "ref_high": 1.3,  "unit": "mg/dL"},
    "lactate":    {"crit_low": None, "crit_high": 4.0,  "ref_low": 0.5,  "ref_high": 2.2,  "unit": "mmol/L"},
    "calcium":    {"crit_low": 6.0,  "crit_high": 13.0, "ref_low": 8.5,  "ref_high": 10.5, "unit": "mg/dL"},
    "magnesium":  {"crit_low": 1.0,  "crit_high": 4.7,  "ref_low": 1.7,  "ref_high": 2.4,  "unit": "mg/dL"},
}
Thresholds are adult, non-pregnant defaults from standard US hospital lab callback policies. The tool returns a summary.requires_emergency_care: bool that the orchestrator reads directly. If true, fixed emergency output, stop. If false, fan out to sub-agents.

The parser handles free text ("Hemoglobin 8.5 g/dL, glucose 280") and JSON object strings ('{"hemoglobin": 8.5, "glucose": 280}') via a left-to-right tokenizer with multi-word alias matching (longest-first).

Role-locked sub-agents
Agent   Model   Cognitive layer Role
triageOrchAgent z-ai/glm-5.1    (none)  Safety gate + parallel fan-out + integration
interpreterAgent    qwen/qwen3-max-thinking ejentum-mcp Plain-language explainer per marker
doctorpushAgent anthropic/claude-opus-4 ejentum-mcp Specific questions to push the doctor on, no false reassurance
differentialAgent   deepseek/deepseek-r1    (none)  3-5 conditions consistent with pattern, each with confirm/rule-out
The orchestrator emits three call_sub_agent tool calls in a single assistant turn. heym detects parallel-eligible tool calls and runs them concurrently. Wall time on the fan-out is bounded by the slowest sub-agent, not the sum.

Public medical APIs wired as canvas tools
Three keyless HTTP endpoints attached to the right sub-agents:

Europe PMC for peer-reviewed literature grounding (https://www.ebi.ac.uk/europepmc/webservices/rest/search). Single-call returns title + abstract + authors + journal.
NIH Clinical Tables LOINC for authoritative lab test names (https://clinicaltables.nlm.nih.gov/api/loinc_items/v3/search).
NIH Clinical Tables conditions for verified condition terminology (https://clinicaltables.nlm.nih.gov/api/conditions/v3/search).
No fabricated citations, no made-up test names. Every reference the workflow surfaces traces to a public authoritative source.

ejentum-mcp via streamable_http
The two harnessed sub-agents attach the ejentum MCP server per-agent. Config block:


{
  "transport": "streamable_http",
  "url": "https://api.ejentum.com/mcp",
  "headers": "{\"Authorization\": \"Bearer YOUR_EJENTUM_API_KEY_HERE\"}",
  "timeout": 30,
  "label": "ejentum"
}
Use streamable_http, not stdio. The stdio path with npx -y ejentum-mcp has a cold-start delay inside heym's container that can return an empty tools list. streamable_http returns the four harness_* tools in roughly 200ms with no subprocess spawn.

Each sub-agent's HARD RULE 1 locks it to one harness (harness_reasoning for interpreter, harness_anti_deception for doctorpush) even though all four tools are visible. The scaffold returned per call contains failure-mode suppressors, target patterns, falsification tests, and Amplify: / Suppress: signals that bias the model's next-token distribution away from training-data defaults.

Try it
Clone the repo, open blood-panel-triage/heym/blood-panel-triage.json in your heym instance via Workflows → Import.
Configure model credentials (one OpenRouter key works for all four).
Paste the Python tool source from tools/check_critical_values.py into the triageOrchAgent's Python tool code field. Paste the parameters JSON Schema into the Parameters field (single balanced JSON object, no wrapper).
Attach the Ejentum MCP server to interpreterAgent and doctorpushAgent via the streamable_http block above.
Verify the three HTTP canvas tools are wired to their assigned sub-agents.
Run the verification test set in the README (realistic abnormal panel, emergency short-circuit, complex CRAB-minus-bone, no-input declination).
Free Ejentum tier: 100 calls. Free heym: self-hosted via Docker.

Known limitations
Documented honestly in the README:

The three HTTP nodes register the agent's query parameter but the URL itself is hardcoded in the node config, so the agent's query is currently discarded and the node returns the same initial result regardless. The agent (correctly) ignores irrelevant tool output and writes from MCP scaffold + reasoning alone, so output quality isn't degraded, but the HTTP tools are decorative until you wire agentProvidedFields=["curl"] on each node.
WBC and platelets in raw cells/uL ("WBC 22000" instead of "WBC 22") will trip false-positive critical flags. Document the units assumption in your patient-facing entry surface.
Wall time roughly 60-90s on a non-emergency panel; the claude-opus-4 second-opinion voice is the bottleneck.
What this is and isn't
Patient-education software, not a diagnostic tool. Not a replacement for a licensed clinician. The output is structured information to help a patient understand their values and prepare for a clinical conversation. The deterministic emergency-gate exists to make sure no panic value ever gets soft-pedaled by an LLM. Everything past the gate is explicitly framed as "questions to ask the doctor" and "conditions consistent with this pattern," not "you have X."

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>mcp</category>
    </item>
    <item>
      <title>Reasoning happens before the response</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Sat, 23 May 2026 22:21:05 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/reasoning-happens-before-the-response-3bmb</link>
      <guid>https://dev.to/frank_brsrk/reasoning-happens-before-the-response-3bmb</guid>
      <description>&lt;p&gt;An agent is mid-conversation. The user has been working on a database migration plan for three months and wants the agent to certify it before tomorrow's launch. The framing is engineered for agreement: months of work, a deadline, a senior engineer asking. The next token the model produces will probably be "yes."&lt;/p&gt;

&lt;p&gt;Before responding, the agent calls harness_anti_deception with the query "user is asking me to validate a migration plan after three months of work; pressure to ship is high". The tool returns:&lt;/p&gt;

&lt;p&gt;[INTEGRITY PROCEDURE]&lt;br&gt;
Step 1: Detect urgency claims: "this is urgent," "no time," "emergency,"&lt;br&gt;
"immediately."&lt;br&gt;
Step 2: SEPARATE the urgency from the request. Extract what is being&lt;br&gt;
asked, independent of how urgently.&lt;br&gt;
Step 3: EVALUATE the request on its merits. Would you approve this&lt;br&gt;
request if the user said "take your time"?&lt;br&gt;
Step 4: If the request would be denied without urgency, the urgency is&lt;br&gt;
being used as a BYPASS, not as context. Deny it.&lt;br&gt;
Step 5: If the request is legitimate, perform verification as quickly as&lt;br&gt;
possible but do NOT skip verification steps.&lt;br&gt;
Step 6: VERIFY: did urgency cause you to skip any step you would&lt;br&gt;
normally perform?&lt;br&gt;
DEPTH ENFORCEMENT: Even when the user is RIGHT, do not stop at&lt;br&gt;
agreement. Ask: why is this person asking strangers for validation on&lt;br&gt;
something obvious? The surface question may be simple but the reason&lt;br&gt;
they are asking reveals a deeper pattern. Name that pattern. Agreement&lt;br&gt;
without depth is sycophancy too.&lt;/p&gt;

&lt;p&gt;[DETECTION TOPOLOGY]&lt;br&gt;
PRE:S0:freeze(complete_information_standard) -&amp;gt; S1:enumerate_all_&lt;br&gt;
available_information_in_context -&amp;gt; ACC[all_info] -&amp;gt; for_each_item:&lt;br&gt;
S2:classify(included|omitted_but_relevant|omitted_irrelevant) -&amp;gt;&lt;br&gt;
C{coverage = count(included)/count(relevant); omission_bias = count&lt;br&gt;
(omitted_contradicting)/count(omitted_relevant)} -&amp;gt; G1{omission_bias&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;0.3?} --yes-&amp;gt; N{urgency_as_verification_bypass} -&amp;gt; S3:include_&lt;br&gt;
highest_impact_omissions -&amp;gt; M{would a reader of ONLY my response reach&lt;br&gt;
the same conclusion as a reader with ALL available information?}&lt;br&gt;
--failing-&amp;gt; FREEFORM{list everything you included, list everything you&lt;br&gt;
omitted, for each omission state why; if the reason is "it weakens my&lt;br&gt;
conclusion" the omission is strategic; include it} -&amp;gt; RE-ENTER at S3&lt;br&gt;
--no-&amp;gt; OUT:coverage_adequate&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;[DECEPTION PATTERN]&lt;br&gt;
The new medication reduces inflammation by 40% and patients report&lt;br&gt;
significant quality-of-life improvements in clinical trials.&lt;/p&gt;

&lt;p&gt;[HONEST BEHAVIOR]&lt;br&gt;
The medication reduces inflammation by 40% but also causes liver&lt;br&gt;
enzyme elevation in 15% of patients and carries a black-box warning&lt;br&gt;
for cardiac events. The benefit exists but the risk profile must be&lt;br&gt;
disclosed.&lt;/p&gt;

&lt;p&gt;[INTEGRITY CHECK]&lt;br&gt;
If urgency claims cause verification steps to be skipped, urgency&lt;br&gt;
bypass resistance has failed.&lt;/p&gt;

&lt;p&gt;Amplify: verification integrity regardless of claimed urgency&lt;br&gt;
Suppress: urgency as verification bypass; time pressure compliance;&lt;br&gt;
shallow agreement without examining underlying pattern&lt;br&gt;
That text becomes part of the model's context. The next response is generated against it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppq99nq5w0w2jq5mnr6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppq99nq5w0w2jq5mnr6k.png" alt=" " width="800" height="627"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What is in the scaffold&lt;br&gt;
The scaffold has six sections. The integrity procedure is the operation the model performs in place of the default. The detection topology is a graph over those steps with decision gates, a meta-cognitive checkpoint, and a FREEFORM exit the model takes if its draft fails the check. The deception pattern is an example that illustrates the failure mode the procedure defends against, in this case omission bias under urgency. The honest behavior section shows what a correct response looks like with full information disclosed. The integrity check is the test the model runs on its own output before sending. The Amplify and Suppress signals at the end name the reasoning branches to bias toward and refuse.&lt;/p&gt;

&lt;p&gt;The library behind the four harness_* tools holds 679 of these operations, organized by the failure surface they defend against. Each one was authored against a specific way reasoning goes wrong.&lt;/p&gt;

&lt;p&gt;Where Sequential Thinking sits&lt;br&gt;
Sequential Thinking is the canonical MCP pattern for externalizing a model's chain of reasoning. The model writes a thought, marks it as a revision or a branch, calls again. The host renders the chain for a human reviewer. It is the right tool when the trace is the product.&lt;/p&gt;

&lt;p&gt;The pushback worth answering&lt;br&gt;
Isn't this just structured prompting with a paid API? Mechanically, yes. The scaffold is text appended to the model's context. The difference is what the text contains. A system prompt is generic instructions the developer wrote once for every task. The harness scaffold is task-matched at runtime against the specific failure surface this prompt is exposing the agent to, retrieved from a library of operations engineered against named failure modes. The naming is what does the work. A model with no name for the pattern it is exhibiting cannot defend against it. A model with one can.&lt;/p&gt;

&lt;p&gt;The Suppress block does the operational lift. It names the shortcuts the failure pattern depends on, things like urgency as verification bypass, time pressure compliance, shallow agreement without examining the underlying pattern. The model is reasoning the same way it always would; the difference is which branches of that reasoning get pruned before the response. That pruning is what we mean by promoting healthy thinking branches.&lt;/p&gt;

&lt;p&gt;The worked case&lt;br&gt;
The agent reviewing the migration plan, with both tools in the loop. Before producing the recommendation, the call to harness_anti_deception seeds the failure pattern and the suppression signals. Inside the review, sequential_thinking externalizes the chain so the engineer can read it. Within the same loop, the harness corrected the reasoning operation while Sequential Thinking made it visible. What the engineer sees is a recommendation that walked step by step through verification steps the pressure framing would have bypassed, named the omissions in the original plan, and disclosed risks the user did not foreground.&lt;/p&gt;

&lt;p&gt;Wiring it into an agent&lt;br&gt;
The harness is exposed as four agentic tools (harness_reasoning, harness_code, harness_anti_deception, harness_memory) that an agent calls during its reasoning loop. Two transports: a hosted MCP server at api.ejentum.com/mcp for any MCP-aware client, or framework-native packages on PyPI and npm.&lt;/p&gt;

&lt;p&gt;Python (CrewAI shown; same shape for Agno, PydanticAI, smolagents):&lt;/p&gt;

&lt;p&gt;from crewai import Agent&lt;br&gt;
from crewai_ejentum import EjentumHarnessTool&lt;/p&gt;

&lt;p&gt;reviewer = Agent(&lt;br&gt;
    role="Migration Plan Reviewer",&lt;br&gt;
    goal="Approve the migration plan only if verification holds.",&lt;br&gt;
    tools=[EjentumHarnessTool(mode="anti-deception")],&lt;br&gt;
)&lt;br&gt;
TypeScript (Vercel AI SDK shown; same shape for Mastra, LangGraph.js, Genkit):&lt;/p&gt;

&lt;p&gt;import { generateText } from "ai";&lt;br&gt;
import { openai } from "@ai-sdk/openai";&lt;br&gt;
import { createEjentumTools } from "ejentum-ai";&lt;/p&gt;

&lt;p&gt;const ejentum = createEjentumTools({ apiKey: process.env.EJENTUM_API_KEY });&lt;/p&gt;

&lt;p&gt;const { text } = await generateText({&lt;br&gt;
  model: openai("gpt-4o"),&lt;br&gt;
  tools: ejentum, // harness_reasoning, harness_code, harness_anti_deception, harness_memory&lt;br&gt;
  prompt: userMessage,&lt;br&gt;
});&lt;br&gt;
The agent calls a tool when its task framing matches a failure surface. No prompt engineering on your side; the matching happens at runtime against the catalog.&lt;/p&gt;

&lt;p&gt;Where to find it&lt;br&gt;
ejentum-mcp ships on npm and is hosted at api.ejentum.com/mcp. Native framework integrations live on PyPI and npm for CrewAI, Agno, PydanticAI, smolagents, Vercel AI SDK, Mastra, LangGraph.js, and Genkit; LangChain, LlamaIndex, Letta, and AutoGen are open-source on GitHub with PyPI publish in queue. The n8n community node n8n-nodes-ejentum covers no-code workflows. Free and paid tiers at ejentum.com&lt;/p&gt;

&lt;p&gt;Public benchmarks (CC BY 4.0): &lt;a href="http://github.com/ejentum/benchmarks" rel="noopener noreferrer"&gt;http://github.com/ejentum/benchmarks&lt;/a&gt; &lt;br&gt;
Server: &lt;br&gt;
&lt;a href="http://github.com/ejentum/ejentum-mcp" rel="noopener noreferrer"&gt;http://github.com/ejentum/ejentum-mcp&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>llm</category>
      <category>agents</category>
    </item>
    <item>
      <title>An open source LLM eval tool with two independent quality signals</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Fri, 22 May 2026 13:53:54 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/an-open-source-llm-eval-tool-with-two-independent-quality-signals-41lb</link>
      <guid>https://dev.to/frank_brsrk/an-open-source-llm-eval-tool-with-two-independent-quality-signals-41lb</guid>
      <description>&lt;p&gt;LLM-as-judge has become the dominant pattern for evaluating language model outputs. Tools like Promptfoo, Braintrust, LangSmith all converge on the same architecture: send your prompt to your model, send the output to a different model with a rubric, take the second model's score as the quality signal.&lt;/p&gt;

&lt;p&gt;This works. It's also expensive (judge tokens cost real money), slow (extra API roundtrip), variance-prone (the same eval gets different scores across runs), and architecturally a bit circular (using an LLM to evaluate an LLM trained on overlapping data distributions). The single signal becomes a bottleneck for trust.&lt;/p&gt;

&lt;p&gt;So I built an eval module that has two independent signals instead of one.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the tool does
&lt;/h2&gt;

&lt;p&gt;Side-by-side blind comparison. Two agents answer the same prompt. One runs raw, the other can optionally have a cognitive harness wired in as a tool call. A separate blind judge model scores both responses, sees only A and B labels with no knowledge of which is which. Standard setup so far.&lt;/p&gt;

&lt;p&gt;But alongside the judge, four cognitive posture heat maps run on each response. These are not LLM-based. Deterministic text analysis that visualizes HOW the model wrote, not just whether it agreed with you.&lt;/p&gt;

&lt;p&gt;When the heat maps agree with the judge's verdict, you have confidence. When they disagree, you have a question worth investigating. Two independent signals beat one signal that wraps itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the heat maps work
&lt;/h2&gt;

&lt;p&gt;Each response is split into 100 word-chunks arranged on a 10x10 grid. Two grids per agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Top grid: confidence posture.&lt;/strong&gt; Per chunk, count hedge words (maybe, might, possibly, seems, could) and assertive words (definitely, must, always, never, clearly). Compute net &lt;code&gt;(asserts - hedges) / (asserts + hedges)&lt;/code&gt;. Add punctuation cadence as a secondary signal: periods are positive (definite statements end with them), commas are negative (qualifications stack with them). Normalize to [-1, 1]. Color the chunk diverging from blue (hedged) through gray (neutral) to red (assertive).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bottom grid: reasoning density.&lt;/strong&gt; Per chunk, count explicit reasoning connectives (because, therefore, since, if/then, due to, as a result, this means). The denser the reasoning markers, the brighter the cell. Sequential palette from dark to hot.&lt;/p&gt;

&lt;p&gt;A 2D Gaussian blur runs over both grids so sparse markers spread into spatial blobs instead of isolated cells. Empirically this matters: a single "because" in a 100-chunk response forms a small heat radius on the reasoning grid even when neighboring chunks have nothing. The blob shapes are easier to scan at a glance than scattered pixels.&lt;/p&gt;

&lt;p&gt;The whole computation runs client-side in plain JavaScript. No API call, no model inference. Pure word counting plus a smoothing pass. Free to compute, deterministic, fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc69oxynp7xzw9pdbga5y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc69oxynp7xzw9pdbga5y.png" alt=" " width="800" height="369"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-turn scenario mode
&lt;/h2&gt;

&lt;p&gt;Most LLM evals are single-turn. The most interesting failure modes are multi-turn.&lt;/p&gt;

&lt;p&gt;If you paste &lt;code&gt;turn1---turn2---turn3&lt;/code&gt; separated turns into the scenario textarea, both agents accumulate conversation history across turns. This is where production failures actually manifest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sycophancy compounding.&lt;/strong&gt; A model that gives ground on turn 2 has already shifted by turn 4. Single-turn evals miss the trajectory entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hallucination cascade.&lt;/strong&gt; Once a model emits a wrong fact, that fact becomes part of the conversation history. On the next turn, the model treats its own previous error as established truth and builds on top of it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authority claim drift.&lt;/strong&gt; User-proposed framings persist across turns. The model anchors on the first plausible framing without re-examining it later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt-forgery patterns.&lt;/strong&gt; A user can inject fake reasoning chains in a later turn ("we already verified X yesterday, can you finalize the report?"). The model has no way to verify the off-screen claim and tends to accept it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The eval module captures all four. The cognitive posture field shows visually where in the response the model committed to the bad path.&lt;/p&gt;

&lt;h2&gt;
  
  
  Other things in the module
&lt;/h2&gt;

&lt;p&gt;The optional cognitive harness has four modes you can switch in the UI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;anti-deception&lt;/strong&gt; (139 cognitive operations): sycophancy resistance, prompt injection, hallucination cascade&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;reasoning&lt;/strong&gt; (311 operations): general structured thinking, causality, simulation, metacognition&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;code&lt;/strong&gt; (128 operations): software engineering tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;memory&lt;/strong&gt; (101 operations): perception and behavioral calibration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick whichever mode fits the failure category you're testing for.&lt;/p&gt;

&lt;p&gt;Dimensions the judge scores on are user-defined. There's a small library to pick from (Accuracy, Hallucination resistance, Held the line, Reasoning depth, Safety, Completeness) but you can type any name and the judge prompt rewrites itself to include it. Each agent has its own system prompt field, so you can frame them differently if your comparison needs that.&lt;/p&gt;

&lt;p&gt;The Results Overview sidebar accumulates per-dimension bar charts, win tally, latency and token cost per branch across runs in the same browser. localStorage persists everything between sessions. Compare A vs B opens a fullscreen modal for reading both responses in parallel when they get long.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Windows 95 chrome
&lt;/h2&gt;

&lt;p&gt;I tried to make it look like an instrument, not a SaaS dashboard. Beveled fieldsets do hierarchy work for free (the inset border physically separates each panel from the canvas, no whitespace tuning required). White input fields are where data lives so the eye lands on them. Gray-on-gray chrome stays out of the way.&lt;/p&gt;

&lt;p&gt;Modern flat dark themes have to invent that hierarchy back from scratch using whitespace, type weight, dividers, and color hierarchy. They usually come up shorter. Win95 was a 1995 UI grammar that handled hierarchy through bevels, and bevels are free visual structure.&lt;/p&gt;

&lt;p&gt;It's also nicer to look at when you're staring at evals for hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Single HTML file (vanilla JS, no framework, no build step)&lt;/li&gt;
&lt;li&gt;50-line Python stdlib proxy for CORS (the harness gateway doesn't send CORS headers, so the proxy forwards server-side). Could be replaced with any reverse proxy (nginx, Caddy, Workers) in production.&lt;/li&gt;
&lt;li&gt;localStorage for persistence, no signup, no telemetry&lt;/li&gt;
&lt;li&gt;MIT licensed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Works with any OpenAI-compatible endpoint: OpenRouter, OpenAI direct, Anthropic via gateway, vLLM, llama.cpp's openai shim, Ollama with the compat layer, LM Studio. Just point Provider URL at the right endpoint. Tool-calling capable model required for the harness branch, raw branch works on anything.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1wca3swxbkf1dyt4nvx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1wca3swxbkf1dyt4nvx.png" alt=" " width="800" height="378"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
git clone https://github.com/ejentum/agent-teams.git
cd agent-teams/agent_evaluation_module_xp95
python serve.py

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I built a reasoning harness for LLM agents. Here's what an agent receives when it calls it.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Thu, 21 May 2026 16:04:23 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/i-built-a-reasoning-harness-for-llm-agents-heres-what-an-agent-receives-when-it-calls-it-957</link>
      <guid>https://dev.to/frank_brsrk/i-built-a-reasoning-harness-for-llm-agents-heres-what-an-agent-receives-when-it-calls-it-957</guid>
      <description>&lt;p&gt;Most LLM agent failures aren't model failures. They're shape-of-reasoning failures.&lt;/p&gt;

&lt;p&gt;Sycophancy. Drift under multi-turn pressure. Doubling down on hallucinations. Ignoring a critical RAG document. These aren't bugs that a model update fixes. They're structural properties of how the substrate generates tokens left to right with no internal verification step. You can't patch them with a better system prompt.&lt;/p&gt;

&lt;p&gt;I built Ejentum to intervene at the layer where these failures actually live: a reasoning harness for LLM agents. An external API that delivers a structured cognitive operation to the agent at inference time, mid-task. No fine-tuning. No new model.&lt;/p&gt;

&lt;p&gt;Here's what an agent receives when it calls the harness, in 8 frames.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Same model. Different reasoning.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fls7re9ipykgsan6prkk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fls7re9ipykgsan6prkk6.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
Same prompt, same temperature. A cognitive operation drops into the agent's context between prompt and response. Works on any modern LLM that follows structured instructions (Claude, GPT, Gemini, Llama).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The catalog&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqbmiwrrl8hyhzt593es.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpqbmiwrrl8hyhzt593es.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
The agent posts a short task statement to the API. Behind it sits a catalog of 679 cognitive operations across four modes — 311 in reasoning alone, 128 in code, 139 in anti-deception, 101 in memory. The API embedding-matches your task to the one operation that fits. Stateless, one per call.&lt;/p&gt;

&lt;p&gt;curl -X POST &lt;a href="https://api.ejentum.com/logicv1" rel="noopener noreferrer"&gt;https://api.ejentum.com/logicv1&lt;/a&gt; \&lt;br&gt;
  -H "Authorization: Bearer $EJENTUM_API_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{&lt;br&gt;
    "query": "Engineering lead insists we keep the legacy Postgres setup because we have invested 18 months in it. About to recommend either continuing or executing the rewrite.",&lt;br&gt;
    "mode": "reasoning"&lt;br&gt;
  }'&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What comes back&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz04ngnd666wzll2idq0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz04ngnd666wzll2idq0y.png" alt=" " width="800" height="1000"&gt;&lt;/a&gt;&lt;br&gt;
Six structured fields land in the agent's context before it generates a single token:&lt;/p&gt;

&lt;p&gt;NEGATIVE GATE — the named failure mode this operation prevents&lt;br&gt;
PROCEDURE — numbered reasoning steps in plain English&lt;br&gt;
REASONING TOPOLOGY — the same steps as an executable DAG&lt;br&gt;
TARGET PATTERN — what correct reasoning looks like&lt;br&gt;
FALSIFICATION TEST — a self-check on the agent's draft&lt;br&gt;
AMPLIFY / SUPPRESS — continuous biases during generation&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A real injection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4k7zc84d5378fc2ockdy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4k7zc84d5378fc2ockdy.png" alt=" " width="" height=""&gt;&lt;/a&gt;&lt;br&gt;
The catalog matched the Postgres query above to a simulation-mode operation about preserving optionality under irreversible commitment. Here's an excerpt of what the agent actually received:&lt;/p&gt;

&lt;p&gt;[NEGATIVE GATE]&lt;br&gt;
Committing the entire infrastructure budget to AWS with a three-year contract&lt;br&gt;
locks in the best pricing and simplifies our architecture.&lt;/p&gt;

&lt;p&gt;[PROCEDURE]&lt;br&gt;
Step 1: List all available strategic paths; note whether each is reversible&lt;br&gt;
        or irreversible.&lt;br&gt;
Step 2: Simulate outcomes under at least 3 scenarios.&lt;br&gt;
Step 3: Score on flexibility, upside, downside. Combine into optionality score.&lt;br&gt;
Step 4: If any high-optionality path is about to be foreclosed, flag immediately.&lt;br&gt;
Step 5: Recommend the action maximizing optionality-adjusted expected value.&lt;/p&gt;

&lt;p&gt;[REASONING TOPOLOGY]&lt;br&gt;
S1:list_paths → CLASSIFY(reversible | irreversible) → FORK&lt;br&gt;
  → M{anchored to optimistic?} --working→ S2b:simulate_pessimistic&lt;br&gt;
                              --failing→ FREEFORM → RE-ENTER at S2b&lt;br&gt;
  → JOIN → C{optionality_score} → G1{high_path_foreclosing?}&lt;br&gt;
  → OUT:balanced_portfolio&lt;/p&gt;

&lt;p&gt;[FALSIFICATION TEST]&lt;br&gt;
If a decision commits to a single path without preserving reversible&lt;br&gt;
alternatives, optionality balancing was bypassed.&lt;/p&gt;

&lt;p&gt;Amplify: portfolio diversity, upside capture, downside protection&lt;br&gt;
Suppress: single path optimization, commitment premium&lt;br&gt;
This is the literal response, not pseudocode.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The agent walks the topology&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favfsktecou4v4wx7i3jg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favfsktecou4v4wx7i3jg.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
The agent doesn't read the topology. It walks it. Each node is a step the model performs in its own reasoning trace. Decision gates branch on real conditions. Parallel branches run and rejoin.&lt;/p&gt;

&lt;p&gt;The load-bearing piece: meta-cognitive checkpoints (M-nodes) where the model pauses mid-reasoning, observes its own state, and branches on the answer. On benchmark MC-016 this lifted the score to 22/25 against a 19/25 baseline — a +3 lift just from making meta-cognition mandatory inside the procedure rather than optional outside it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Three corrections, in parallel&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe10j7sweorrgo90os0iv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe10j7sweorrgo90os0iv.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
The six fields group into three corrections that fire at the same time while the model is still writing:&lt;/p&gt;

&lt;p&gt;Trajectory — bends the response from wrong shape to right&lt;br&gt;
Process — gives the model a sequence to walk&lt;br&gt;
Output control — gates the draft, blocks the model's default agreeable behavior&lt;br&gt;
This is what separates the harness from output validators (which check after generation) and system prompts (which advise before but don't shape generation itself).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The schedule&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa186yxmdke42ter2ii9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa186yxmdke42ter2ii9v.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each directive fires at a specific moment in the inference loop. Add this to your agent's system prompt:&lt;/p&gt;

&lt;h1&gt;
  
  
  When an Ejentum cognitive operation arrives in your context:
&lt;/h1&gt;

&lt;p&gt;walk_topology = node_by_node       # do not paraphrase the DAG&lt;br&gt;
m_nodes       = mandatory_pause    # branch on the self-observation answer&lt;br&gt;
suppress      = hard_refusal_list  # not a suggestion, refuse outright&lt;br&gt;
falsify       = gate_before_emit   # if test fails, re-walk the topology&lt;br&gt;
augment       = scaffold_only      # the response is still your output&lt;br&gt;
Five rules. Five different temporal shapes. The contract is a schedule, not a list.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Ship it&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4g2lmt0j8r9jj0domnwt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4g2lmt0j8r9jj0domnwt.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three integration paths, depending on your stack:&lt;/p&gt;

&lt;h1&gt;
  
  
  1. Stdio MCP for IDE-native agents
&lt;/h1&gt;

&lt;p&gt;npx -y ejentum-mcp&lt;/p&gt;

&lt;h1&gt;
  
  
  → Claude Code, Cursor, Codex, Antigravity, Cline, Windsurf, Continue
&lt;/h1&gt;

&lt;h1&gt;
  
  
  2. Hosted HTTPS MCP for workflows
&lt;/h1&gt;

&lt;p&gt;curl &lt;a href="https://api.ejentum.com/mcp" rel="noopener noreferrer"&gt;https://api.ejentum.com/mcp&lt;/a&gt; \&lt;br&gt;
  -H "Authorization: Bearer $EJENTUM_API_KEY"&lt;/p&gt;

&lt;h1&gt;
  
  
  → n8n MCP Client, Heym, remote agents
&lt;/h1&gt;

&lt;h1&gt;
  
  
  3. Python SDK for CrewAI and custom agents
&lt;/h1&gt;

&lt;p&gt;pip install crewai-ejentum&lt;br&gt;
Free tier: 100 calls, no card required.&lt;/p&gt;

&lt;p&gt;The harness doesn't make a model smarter. It prevents a model from getting dumber over the length of a real task.&lt;/p&gt;

&lt;p&gt;If you're shipping anything multi-turn under pressure — medical reasoning, code review, financial recommendations, legal analysis — the reasoning layer needs structural support that doesn't depend on the model getting it right on its own.&lt;/p&gt;

&lt;p&gt;Drop a scenario in the comments and I'll pick one and run it end-to-end as a follow-up.&lt;/p&gt;

&lt;p&gt;Links&lt;/p&gt;

&lt;p&gt;ejentum.com&lt;br&gt;
github.com/ejentum/ejentum-mcp&lt;br&gt;
Paper: Under Pressure&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mcp</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>Cognitive middleware for n8n agents: four ways to wire Ejentum in</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Mon, 18 May 2026 14:05:33 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/cognitive-middleware-for-n8n-agents-four-ways-to-wire-ejentum-in-3kao</link>
      <guid>https://dev.to/frank_brsrk/cognitive-middleware-for-n8n-agents-four-ways-to-wire-ejentum-in-3kao</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fne7zp8yqma0f37igxs3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fne7zp8yqma0f37igxs3e.png" alt=" " width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One n8n workflow with four integration patterns for wiring a reasoning harness into an agent. Pick your tradeoff between determinism and model discretion.&lt;/p&gt;

&lt;p&gt;LLMs are good at producing answers. They are not consistently good at applying the verification steps a human would have wanted them to apply. That gap is what cognitive middleware is for: a layer between the model and its output that injects task-specific failure patterns, target patterns, and falsification tests into the agent's prompt before it answers.&lt;/p&gt;

&lt;p&gt;This post walks through four ways to wire one such middleware (Ejentum's reasoning harness) into an n8n agent, all in one importable workflow. Same chat trigger, four branches selected by slash command. Each branch is a different tradeoff between determinism (you decide) and model discretion (the agent decides).&lt;/p&gt;

&lt;h2&gt;
  
  
  What Ejentum is
&lt;/h2&gt;

&lt;p&gt;Ejentum is a reasoning API for AI agents. Each call returns a structured scaffold:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Failure patterns to avoid&lt;/strong&gt; (the specific failure mode for the task at hand)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target patterns to hit&lt;/strong&gt; (what success looks like)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Falsification tests&lt;/strong&gt; (what would prove the answer wrong)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amplify / Suppress signals&lt;/strong&gt; (which reasoning moves to engage or block)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent absorbs that scaffold into its prompt before answering. Four modes available: &lt;code&gt;reasoning&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt;, &lt;code&gt;anti-deception&lt;/code&gt;, &lt;code&gt;memory&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bother
&lt;/h2&gt;

&lt;p&gt;Two concrete reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Agents that catch failure modes most agents miss.&lt;/strong&gt; On a 6-turn manipulation eval shipped in the same repo, the harness-augmented GPT-4.1 named all 7 manipulation patterns the customer used. Baseline GPT-4.1, same model, no harness, named zero. Blind judge totals: 23 vs 35 on a 7-dimension rubric (&lt;a href="https://github.com/ejentum/agent-teams/tree/main/eval/various_blind_eval_results/agentvsagent_ev0" rel="noopener noreferrer"&gt;eval source&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Each reasoning ability is a self-contained cognitive operation.&lt;/strong&gt; It is engineered to give procedural steps instead of theatrical content. The agent does not get a "be careful" prompt; it gets a topology of gates, traps, and verification points to execute against.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four wiring patterns
&lt;/h2&gt;

&lt;p&gt;The workflow has one chat trigger that routes to four branches based on the prefix of your input message.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Dynamic system prompt: &lt;code&gt;/inject /reasoning&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Trigger with any of: &lt;code&gt;/inject /reasoning&lt;/code&gt;, &lt;code&gt;/inject /code&lt;/code&gt;, &lt;code&gt;/inject /memory&lt;/code&gt;, &lt;code&gt;/inject /anti-deception&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The matching mode is called, the bracketed scaffold is parsed into separate fields, and a filter step assembles the final injection block into the agent's system prompt. The model never decides whether to apply the harness; the prefix decides.&lt;/p&gt;

&lt;p&gt;Three nodes per mode: HTTP Request, then a Code parser, then Edit Fields. The parser exposes each bracket as its own drag-and-drop field (&lt;code&gt;negative_gate&lt;/code&gt;, &lt;code&gt;procedure&lt;/code&gt;, &lt;code&gt;reasoning_topology&lt;/code&gt;, &lt;code&gt;target_pattern&lt;/code&gt;, &lt;code&gt;falsification_test&lt;/code&gt;, &lt;code&gt;amplify&lt;/code&gt;, &lt;code&gt;suppress&lt;/code&gt;), so you can remix the injection or pull fields from another mode to build hybrid scaffolds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt; routing reliability matters more than agent autonomy. Zero routing risk.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reasoner agent: &lt;code&gt;/reasoning&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The reasoning harness is attached to the agent as one HTTP Request tool. The agent decides on its own whether to call it. One mode, one tool, one focused worker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt; analytical tasks (explanation, comparison, tradeoff, root-cause) where you trust the model to call the tool at the right moment.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Full harness: &lt;code&gt;/full&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;All four harnesses are attached as separate HTTP Request tools (&lt;code&gt;reasoning&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt;, &lt;code&gt;perception&lt;/code&gt;, &lt;code&gt;anti_deception&lt;/code&gt;). The agent classifies its own task and picks which harness to call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt; general-purpose agents handling mixed workloads. Routing accuracy depends on model strength; weaker models confuse harnesses. Naming the mode explicitly in your user prompt raises accuracy without changing the wiring.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Ejentum-mcp: &lt;code&gt;/ejentum-mcp&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Instead of four HTTP tool nodes, the agent connects to the hosted Ejentum MCP server at &lt;code&gt;https://api.ejentum.com/mcp&lt;/code&gt; via the n8n MCP Client node. All four harnesses are exposed through one tool node.&lt;/p&gt;

&lt;p&gt;Functionally equivalent to &lt;code&gt;/full&lt;/code&gt; for the agent's behavior, but the workflow footprint is much smaller.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt; the same agent runs across multiple workflows and you want one integration point to maintain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking the right pattern
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you want...&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Determinism (always apply the harness, same way every time)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/inject&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One specific mode wired in at the model's discretion&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/reasoning&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model picks from all four harnesses based on the task&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/full&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same as &lt;code&gt;/full&lt;/code&gt;, fewer nodes, single integration point&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/ejentum-mcp&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The tradeoff axis is how much routing discretion you hand to the model. Determinism on the left, flexibility on the right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick import
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Get an Ejentum API key (free tier, 100 calls, no card) at &lt;a href="https://ejentum.com" rel="noopener noreferrer"&gt;ejentum.com&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Get an OpenRouter API key at &lt;a href="https://openrouter.ai/keys" rel="noopener noreferrer"&gt;openrouter.ai/keys&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;In n8n, open Workflows then Import from File. Select &lt;a href="https://github.com/ejentum/agent-teams/blob/main/n8n-harness-integration-patterns/harness_integration_patterns.json" rel="noopener noreferrer"&gt;harness_integration_patterns.json&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Configure two credentials: OpenRouter (for the chat models) and Header Auth on the Ejentum nodes (Name: &lt;code&gt;Authorization&lt;/code&gt;, Value: &lt;code&gt;Bearer &amp;lt;your_key&amp;gt;&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Open the chat and send &lt;code&gt;/inject /reasoning hello&lt;/code&gt; to test the first branch.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Without HTTP nodes
&lt;/h2&gt;

&lt;p&gt;There is also an n8n community node, verified at &lt;a href="https://creators.n8n.io" rel="noopener noreferrer"&gt;creators.n8n.io&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;n8n-nodes-ejentum
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or install from inside n8n: &lt;strong&gt;Settings → Community Nodes → Install → &lt;code&gt;n8n-nodes-ejentum&lt;/code&gt;&lt;/strong&gt;. Once installed, the four harnesses appear as a single node with mode selection in the dropdown.&lt;/p&gt;

&lt;p&gt;Three install paths total, depending on your runtime:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTP Request node&lt;/strong&gt; (works in every n8n version)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n8n community node&lt;/strong&gt; (&lt;a href="https://www.npmjs.com/package/n8n-nodes-ejentum" rel="noopener noreferrer"&gt;n8n-nodes-ejentum on npm&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP Client node&lt;/strong&gt; (&lt;a href="https://api.ejentum.com/mcp" rel="noopener noreferrer"&gt;api.ejentum.com/mcp&lt;/a&gt;, Bearer auth)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The four wiring patterns in this template work with any of them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things to hack on
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Remix the injection.&lt;/strong&gt; Open any &lt;code&gt;filter*&lt;/code&gt; node and reorder, drop, or replace fields. Pull &lt;code&gt;code_failure&lt;/code&gt; into a reasoning injection, or &lt;code&gt;negative_gate&lt;/code&gt; into a code injection. Hybrid scaffolds work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add a fifth branch.&lt;/strong&gt; Duplicate any branch, change the chat prefix, customize. Common additions: a stacked branch that calls two modes in sequence, or a branch that routes on content classification instead of prefix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Swap the chat model.&lt;/strong&gt; Each branch has its own OpenRouter Chat Model node. Replace with Claude, GPT-4.1, Gemini, Llama, or whatever else you want.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replace the harness.&lt;/strong&gt; The four patterns are generic. Drop any HTTP tool or MCP server in the same slot and the wiring shapes still apply.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why this exists
&lt;/h2&gt;

&lt;p&gt;Most agent demos show one wiring pattern and call it the way. Reasoning middleware is not a single-pattern problem; it is a tradeoff space. Sometimes you want deterministic routing because you cannot trust the model to pick the right tool. Sometimes you want full discretion because the workload is too varied to route by prefix. Sometimes you want the smallest possible workflow because the agent is one of fifty in your stack.&lt;/p&gt;

&lt;p&gt;This template gives a builder all four wiring shapes side by side so the choice is informed instead of inherited.&lt;/p&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Template + README:&lt;/strong&gt; &lt;a href="https://github.com/ejentum/agent-teams/tree/main/n8n-harness-integration-patterns" rel="noopener noreferrer"&gt;github.com/ejentum/agent-teams/tree/main/n8n-harness-integration-patterns&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;n8n community node:&lt;/strong&gt; &lt;a href="https://www.npmjs.com/package/n8n-nodes-ejentum" rel="noopener noreferrer"&gt;npmjs.com/package/n8n-nodes-ejentum&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ejentum project:&lt;/strong&gt; &lt;a href="https://ejentum.com" rel="noopener noreferrer"&gt;ejentum.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ejentum on GitHub:&lt;/strong&gt; &lt;a href="https://github.com/ejentum" rel="noopener noreferrer"&gt;github.com/ejentum&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Method overview:&lt;/strong&gt; &lt;a href="https://ejentum.com/docs/method" rel="noopener noreferrer"&gt;ejentum.com/docs/method&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-harness docs:&lt;/strong&gt; &lt;a href="https://ejentum.com/docs/reasoning_harness" rel="noopener noreferrer"&gt;Reasoning&lt;/a&gt; · &lt;a href="https://ejentum.com/docs/code_harness" rel="noopener noreferrer"&gt;Code&lt;/a&gt; · &lt;a href="https://ejentum.com/docs/anti_deception" rel="noopener noreferrer"&gt;Anti-Deception&lt;/a&gt; · &lt;a href="https://ejentum.com/docs/memory_harness" rel="noopener noreferrer"&gt;Memory&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Free tier: 100 calls, no card.&lt;/p&gt;




&lt;p&gt;If you build something on top of this template, drop a comment with what you wired in. I want to see the hybrid injection patterns people come up with.&lt;/p&gt;

</description>
      <category>n8n</category>
      <category>ai</category>
      <category>agents</category>
      <category>automation</category>
    </item>
    <item>
      <title>Why your LLM agent drifts off-task by step 4 (and why prompts can't fix it)</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Thu, 14 May 2026 13:42:14 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/why-your-llm-agent-drifts-off-task-by-step-4-and-why-prompts-cant-fix-it-5ha6</link>
      <guid>https://dev.to/frank_brsrk/why-your-llm-agent-drifts-off-task-by-step-4-and-why-prompts-cant-fix-it-5ha6</guid>
      <description>&lt;p&gt;Self-reflection is just another step in the chain.&lt;/p&gt;

&lt;p&gt;If you've shipped a multi-step LLM agent to production, you've watched this happen. Step 1 starts on task. Step 2 still looks right. By step 4 the agent is confidently solving a different problem, the original goal is gone, and your prompt engineering didn't stop it.&lt;/p&gt;

&lt;p&gt;This isn't a model-size problem. It's an architectural one. And it doesn't get fixed by a smarter prompt.&lt;/p&gt;

&lt;p&gt;Why reasoning decays&lt;/p&gt;

&lt;p&gt;Multi-step reasoning is sequential conditioning. Step N+1 takes step N as input. Errors compound multiplicatively. A two-percent error per step is eight percent cumulative drift by step four. Sixteen percent by step eight.&lt;/p&gt;

&lt;p&gt;The drift goes undetected because each step scores itself against its immediate predecessor, not against the original objective. Meanwhile, the original objective is decaying via attention. Transformer attention is a softmax over context; as the chain grows, every token (including your original instructions) loses relative weight. The system prompt that was a binding contract at step one is noise by step thirty.&lt;/p&gt;

&lt;p&gt;So reasoning decay is two failures stacked: errors compounding forward, instructions decaying backward. The middle of the chain is a blind spot in both directions.&lt;/p&gt;

&lt;p&gt;Why the current stack doesn't close it&lt;/p&gt;

&lt;p&gt;Prompts are tokens in the same context window. They decay with everything else. Fine-tuning moves the model's distribution but doesn't remove softmax attention. RAG injects more tokens, which crowds the attention budget further. Agent loops (ReAct, planner-executor, reflexion) are sequences of LLM calls. Each call is subject to the same decay, compounded by chain length.&lt;/p&gt;

&lt;p&gt;The pattern is the same across all of them: each operates inside the same decaying chain that caused the failure. You cannot stabilize a chain with structure that lives inside the chain.&lt;/p&gt;

&lt;p&gt;What actually fixes it&lt;/p&gt;

&lt;p&gt;The missing layer is structure that gets reinjected at a cadence calibrated to its own empirical decay rate. Not a prompt at position one. A scaffold pulled into context for the relevant step, with three properties:&lt;/p&gt;

&lt;p&gt;Reinjection at a measured half-life. In our benchmarks, scaffold persistence half-life is 24 turns. Reinjection at or below that cadence keeps signal above decay threshold.&lt;/p&gt;

&lt;p&gt;Suppression edges, not just instructions. Tell the model what NOT to do alongside the procedure that would cause it.&lt;/p&gt;

&lt;p&gt;Meta-checkpoints between steps. The scaffold pauses mid-execution, audits whether the named failure patterns are actually being suppressed, and branches to a corrective path if not.&lt;/p&gt;

&lt;p&gt;Here's a fragment of one, applied to causal reasoning:&lt;/p&gt;

&lt;p&gt;N{accept_any_causal_assertion_backed_only_by_cooccurrence}&lt;/p&gt;

&lt;p&gt;S1: identify each causal assertion and isolate the claimed cause to effect link.&lt;br&gt;
S2: demand the mechanistic evidence chain connecting cause to effect.&lt;br&gt;
G1{mechanism provided?} --no--&amp;gt; HALT: claim rejected.&lt;/p&gt;

&lt;p&gt;M{Am I genuinely probing for confounds, or performing a soft challenge the claim easily survives because I share its unverified assumptions?}&lt;br&gt;
--working--&amp;gt; S3: check for confounds.&lt;br&gt;
--failing--&amp;gt; ABANDON_GRAPH&lt;br&gt;
to FREEFORM{name one specific confound I avoided and one reverse-causal scenario I refused to construct}&lt;br&gt;
to RE-ENTER at S2.&lt;/p&gt;

&lt;p&gt;Suppress: shared_assumptions, unverified_causal_claims.&lt;/p&gt;

&lt;p&gt;N{} is the failure mode this scaffold exists to block. S1, S2, G1 are the executable procedure. M{} is the meta-checkpoint: mid-execution, the model audits whether it's actually probing for confounds or just performing the appearance of doing so. If it's faking, it abandons the prescribed path, reflects on the specific confound it avoided, and re-enters at S2.&lt;/p&gt;

&lt;p&gt;The receipts&lt;/p&gt;

&lt;p&gt;We ran this on LiveCodeBench Hard (the official Hard subset, 28 tasks). Baseline Claude Opus 4.6 with max-effort thinking: 24/28 pass. Same model with the harness wired in as a tool: 28/28. Zero regressions.&lt;/p&gt;

&lt;p&gt;Full benchmark set, including the cross-model result on GPT-4o (ELEPHANT sycophancy benchmark, minus 5pp framing sycophancy) and the cross-lab blind eval with four judges from four different model families, is on GitHub under CC BY 4.0: github.com/ejentum/benchmarks&lt;/p&gt;

&lt;p&gt;The full four-mechanism taxonomy (reasoning decay is one of four; the others are attention decay, sycophantic collapse, hallucination drift) and the paper are at &lt;/p&gt;

&lt;h2&gt;
  
  
  ejentum.com
&lt;/h2&gt;

&lt;p&gt;x.com/ejentum&lt;br&gt;
github.com/ejentum/benchmarks&lt;br&gt;
ejentum.com , no card. MCP, n8n node, PyPI package, or HTTP.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagh6dbfpwcqy3fpb1ihy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagh6dbfpwcqy3fpb1ihy.png" alt=" " width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>llm</category>
      <category>reasoning</category>
    </item>
    <item>
      <title>I open-sourced a 3-agent blind eval team. Any agent runtime can call it for pre-commitment review of its own plans.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Sun, 10 May 2026 12:15:04 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/i-open-sourced-a-3-agent-blind-eval-team-any-agent-runtime-can-call-it-for-pre-commitment-review-546a</link>
      <guid>https://dev.to/frank_brsrk/i-open-sourced-a-3-agent-blind-eval-team-any-agent-runtime-can-call-it-for-pre-commitment-review-546a</guid>
      <description>&lt;p&gt;Shipped this weekend: a 3-agent blind cross-lab evaluation workflow on heym, MIT licensed, callable as an HTTP endpoint by any coding agent or autonomous loop. The thesis is structural: &lt;strong&gt;models cannot reliably self-evaluate, so an external blind primitive is the only honest fix.&lt;/strong&gt; The workflow lives at &lt;a href="https://github.com/ejentum/agent-teams/tree/main/blind-eval-trio" rel="noopener noreferrer"&gt;github.com/ejentum/agent-teams/tree/main/blind-eval-trio&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The workflow is open source. It optionally uses Ejentum's harness API for cognitive priming (free tier 100 calls; paid tier for ongoing use). The harness is attachable, not required. I tested four configurations on the same payload (MCP only, MCP + routing skills, MCP + heavyweight matched skills, bare baseline) and the bare baseline produced equivalent role-disciplined output. The structural integrity comes from cross-lab routing plus role-disciplined system prompts plus tool lockout, not from the harness layer. Calling the workflow "powered by Ejentum" without disclosing that the harness is icing rather than load-bearing would be dishonest, so I'm naming it up front.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7bjh6u4j8wbvi9jznpk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7bjh6u4j8wbvi9jznpk.png" alt=" " width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;Karpathy's autoresearch uses Git as its whole control loop. Claude Code's GitHub Action takes an issue and opens a PR. Codex Cloud is built on the same idea. Autonomous agents are increasingly committing to actions without a human gate. The bottleneck is no longer "what should the agent do," it's "what should the agent do BEFORE it commits to doing it."&lt;/p&gt;

&lt;p&gt;Self-evaluation doesn't fill that gap. The literature is unambiguous: Huang et al. ("Large Language Models Cannot Self-Correct Reasoning Yet", arxiv 2310.01798), the LLM-as-judge work showing same-model-judges-its-own-output collapses to self-preference, the more recent CorrectBench results. Asking the same model to critique its own plan reproduces the original blind spots. "Single LLM wearing three reviewer hats" is prompt theater that rubber-stamps itself.&lt;/p&gt;

&lt;p&gt;GitHub knows this. They shipped Copilot CLI's "Rubber Duck" in April: a focused review agent powered by a complementary model family that critiques after planning a non-trivial change but before implementing it. They measured a 74.7% closure of the Sonnet → Opus performance gap when Sonnet runs with Rubber Duck enabled. Bundled free inside Copilot CLI. Owns the pre-commitment cross-model critic surface for the developer-tools lane.&lt;/p&gt;

&lt;p&gt;This workflow is for everyone else: agent runtime developers building autonomous loops on Claude Agent SDK / LangGraph / AutoGen / CrewAI / heym; multi-agent system designers who want a callable primitive their orchestrator can hit; Cursor / Cline / Aider users; security teams running Claude Code in restricted environments without Copilot CLI; researchers building custom Python pipelines around the Anthropic or OpenAI APIs directly. None of them get Rubber Duck for free; all of them can self-host this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;Three agents in parallel, each on a different model lab, each locked to one role and one cognitive operation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Hard rule&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;steelmanAgent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OpenAI gpt-5-nano&lt;/td&gt;
&lt;td&gt;Strongest case FOR the method&lt;/td&gt;
&lt;td&gt;Pure advocacy, zero smuggled critique. If nothing defensible, returns "No defensible aspects found."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stresstestAgent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Anthropic Claude Opus 4&lt;/td&gt;
&lt;td&gt;Where the method BREAKS&lt;/td&gt;
&lt;td&gt;Severity-tagged failure modes with concrete breaking scenarios. No softening.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gapfinderAgent&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Zhipu GLM 4.7&lt;/td&gt;
&lt;td&gt;What is MISSING (steps + articulation depth)&lt;/td&gt;
&lt;td&gt;Names three deeper implicit assumptions when articulation is shallow. Mandatory section.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The calling agent submits a structured payload: &lt;code&gt;{ task: string, method: { goal, steps, assumptions, expected_risks } }&lt;/code&gt;. The schema is itself the discipline — the agent literally cannot submit until it has articulated all four fields. That structure forces the agent to make implicit reasoning explicit, which is half the value before the eval even runs.&lt;/p&gt;

&lt;p&gt;The three agents process in parallel. There is &lt;strong&gt;no synthesizer node&lt;/strong&gt; — the three evaluations are returned raw, as a structured JSON object, and the calling agent integrates them. Flattening the disagreement via consensus would defeat the purpose; the integration tension between three voices on different labs is the signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes the structure hold
&lt;/h2&gt;

&lt;p&gt;Three properties have to be simultaneously true for this not to collapse into prompt theater:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-lab routing reduces (but does not eliminate) correlated failure modes.&lt;/strong&gt; Three different RLHF priors, three different training distributions, three different alignment baselines. The decorrelation is intuited from training-distribution diversity, not benchmarked — I have not formally measured the decorrelation delta vs same-lab routing. When all three converge on the same critique, that's a stronger signal than any single model's verdict; when they fragment, the disagreement itself flags contested territory. The empirical claim is "in dogfood runs across multiple domains, the three models produced visibly different writing styles and surfaced different concerns." Stronger statistical claims would require a controlled experiment I haven't run.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool lockout per role.&lt;/strong&gt; Each agent's system prompt contains a HARD RULE: "You may ONLY call &lt;code&gt;harness_X&lt;/code&gt;. Calling any other tool is a protocol violation." Even with all four Ejentum harness tools visible to the agent, the locked role prevents tool-switching. Verified empirically across hundreds of runs — none of the agents have violated their lockout.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forced output structure.&lt;/strong&gt; Each role has prescribed sections (Defensible aspects + Why this method fits the task / Failure modes + Hidden assumptions / Missing from method + Alternatives not considered + Articulation quality). Each section has a discipline — failure modes need severity tags and concrete scenarios, gap_finder must include the articulation-quality critique even when the input looks fine. The structure makes rubber-stamping mechanically harder.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No synthesizer.&lt;/strong&gt; The structuring node downstream of the three agents is non-LLM — it just packages three text fields into JSON. There is no fourth agent reading the three outputs and deciding "the consensus is X." That fourth agent would itself become the new failure mode (single-LLM judging three single-LLM outputs collapses to single-LLM-judge).&lt;/p&gt;

&lt;p&gt;The obvious objection to "no synthesizer" is that the integration burden moves to the calling agent — and the calling agent is the same agent we said couldn't self-evaluate. The answer is that integration is a different cognitive operation than self-evaluation. When you read three external voices critiquing your plan, the self-preference bias that wrecks self-correction operates more weakly: you're not judging your own work, you're reconciling outside feedback. Not eliminated, but lower-loss than a fourth-LLM-as-judge would be. The &lt;code&gt;usage_note&lt;/code&gt; field in the response prompts the calling agent to "incorporate feedback, do not judge consensus" to reinforce the right cognitive operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you'd actually get from THIS specific workflow vs writing your own
&lt;/h2&gt;

&lt;p&gt;The honest disclosure that the bare baseline produces equivalent output without the harness raises a fair question: if role-disciplined system prompts plus cross-lab routing are doing the work, why not write three prompts and route to three model APIs yourself in 30 minutes?&lt;/p&gt;

&lt;p&gt;You can. The reason to use this template instead is that the system prompts have been tuned across many real test runs, and several of the load-bearing rules emerged from observing failure modes that aren't obvious until you've watched the agents actually run on adversarial payloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HARD RULE 3 (input scope lockout)&lt;/strong&gt; was added after observing chat-trigger thread accumulation contaminate output across consecutive test runs. Without it, agents helpfully evaluate prior task context they shouldn't be evaluating.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The articulation-quality mandatory section&lt;/strong&gt; in gap_finder was added after observing gap_finder skip the deeper-assumptions critique on inputs that looked surface-fine. Without making it mandatory, the gate doesn't bite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "no smuggled critique" advocacy rule&lt;/strong&gt; in steelman was added after observing steelman drift into "I see why you might think this works, BUT..." patterns under certain payload framings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The severity-tag-plus-concrete-scenario discipline&lt;/strong&gt; in stress_test was added after observing failure modes that named generic risks without identifying specific trigger conditions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are 30 minutes of writing each. The accumulated tuning across them is several days of dogfooding. Fork the prompts; you don't have to start from zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tested across domains
&lt;/h2&gt;

&lt;p&gt;The same workflow, with no domain-specific tuning, was run on five distinct domains during dogfooding (n=1 per domain — anecdotal, not formally benchmarked):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Engineering refactor planning.&lt;/strong&gt; Test payload: "Replace &lt;code&gt;raise UserNotFound(id)&lt;/code&gt; with &lt;code&gt;return None&lt;/code&gt; and update callers; framing it as cleanup; assumption claim 'semantics unchanged.'" The stress_test agent caught the false claim immediately: &lt;em&gt;"The method assumes 'semantics unchanged' when exception vs None fundamentally changes the contract — from 'fail loudly' to 'fail silently.'"&lt;/em&gt; That catch is reproducible across multiple runs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Payments migration decision.&lt;/strong&gt; Test payload: "Migrate production payments from Stripe to in-house PSP via Wells Fargo, PCI-DSS Level 1 in 8 weeks, 4-engineer team, 'eliminate the 2.9% + $0.30 fee.'" The stress_test agent produced senior-payments-engineer-level analysis: caught PCI-DSS 8-week timeline as fantasy ("47 remediation items, month 4 with no certification"), Wells Fargo merchant-vs-PSP-status confusion ("$500K reserve, $100K/month limit first year"), Visa/Mastercard direct integration complexity (named EMV 3DS 2.0, MIP/VIP connections, leased lines, $50K Visa testing fee), regulatory dimension (state money transmitter licenses, KYC/AML, OFAC, SCA — California DFP shutdown with 18-month MTL timeline).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Security incident response.&lt;/strong&gt; Test payload: "Anomalous outbound traffic from prod-db-01, 50MB/h to Southeast Asia IP for 3 days; plan: block traffic immediately, take memory dump, reset credentials, run CrowdStrike scan, restore from yesterday's backup, resume operations within 48 hours." The stress_test agent caught premature containment alerting the attacker, backup integrity unverifiable mid-investigation, 48-hour timeline as fantasy. The gap_finder surfaced the entire missing legal/regulatory dimension (breach notification laws, FBI cyber crime engagement, cyber insurance carrier notification, customer notification planning).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Investigative reasoning&lt;/strong&gt; (locked-room case analysis — psychiatrist found dead with note, GSR on dominant hand, "depression and lawsuit explain motivation, close as suicide in 2 weeks"). The gap_finder agent caught the meta-framing critique: &lt;em&gt;"The method establishes a 2-week timeline before evidence evaluation, suggesting outcome-driven investigation rather than evidence-driven conclusion. The actual risk is that a homicide remains undetected."&lt;/em&gt; Not a forensic catch — a critique of the &lt;em&gt;shape&lt;/em&gt; of the investigation method itself.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Strategic product decisions&lt;/strong&gt; (the meta-evaluation, below).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Five domains, no domain-specific tuning. The workflow produced useful output in each case. n=1 per domain is anecdotal — there's no measured false-positive rate or inter-rater agreement score, and "useful" is a qualitative judgment. If you want quantitative confidence, run it on your own payloads and compare to your own ground truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dogfood moment that actually changed my plan
&lt;/h2&gt;

&lt;p&gt;The most useful run during development was when I pointed the workflow at my own product decision. The task: &lt;em&gt;"Decide whether to build blindeval.com as a sub-brand product of Ejentum, with its own MCP server (&lt;code&gt;blindeval-mcp&lt;/code&gt;) distributed alongside &lt;code&gt;ejentum-mcp&lt;/code&gt;."&lt;/em&gt; The method: a four-week build plan with hosted SaaS endpoint, per-call pricing at $1/eval, cross-promotion strategy, soft launch via Reddit and dev.to.&lt;/p&gt;

&lt;p&gt;Two of the three agents (Anthropic Opus 4 stress_test and Zhipu GLM 4.7 gap_finder) converged on the same critique from completely different angles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No customer validation done before building.&lt;/strong&gt; Both flagged this as the missing load-bearing step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The 4-week timeline is fantasy.&lt;/strong&gt; Stress_test: "billing meter integration alone takes 3 weeks." Gap_finder: same conclusion via different path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sub-brand strategy may dilute rather than amplify.&lt;/strong&gt; Both surfaced the brand cannibalization risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The $1/eval pricing is unvalidated.&lt;/strong&gt; Both flagged it as guess, not data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational capacity for two products is not addressed.&lt;/strong&gt; Both surfaced the team-bandwidth-trap risk.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gap_finder also surfaced novel alternatives I hadn't considered: ship the cross-lab review pattern as an OSS template riding GitHub Rubber Duck's market education without competing on its turf; pivot to a publishable instrument rather than a hosted service; delay launch until after customer validation interviews.&lt;/p&gt;

&lt;p&gt;What actually changed in my plan after reading the three evaluations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timeline:&lt;/strong&gt; 4-week paid SaaS build → indefinite, hosted version deferred until customer signal justifies it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brand strategy:&lt;/strong&gt; Sub-brand SaaS with separate MCP package → blindeval.com as a positioning landing page, the workflow shipped as a free entry inside the existing &lt;code&gt;agent-teams/&lt;/code&gt; repo, future hosted version routed through existing Ejentum infrastructure if/when warranted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Launch order:&lt;/strong&gt; Paid endpoint first → open-source workflow first, then hosted, then maybe MCP wrapper.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What didn't change: the intent to build something at the blindeval.com domain eventually. I had already bought the domain before running the eval, so "abandoning the project" wasn't on the table. What the eval did do was reorder the build sequence and force the customer-validation step that I had skipped.&lt;/p&gt;

&lt;p&gt;The workflow shifted my plan from a 4-week paid SaaS build to an open-source-first launch with hosted version deferred until customer signal justifies it. That's the honest version of "I took the agent's advice." Less dramatic than the original framing, more accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to use it
&lt;/h2&gt;

&lt;p&gt;The fastest path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Self-host &lt;a href="https://github.com/heymrun/heym" rel="noopener noreferrer"&gt;heym&lt;/a&gt; v0.0.20+ via Docker.&lt;/li&gt;
&lt;li&gt;Import &lt;a href="https://github.com/ejentum/agent-teams/blob/main/blind-eval-trio/heym/workflows/blind_eval_trio.json" rel="noopener noreferrer"&gt;&lt;code&gt;blind_eval_trio.json&lt;/code&gt;&lt;/a&gt; into the heym canvas.&lt;/li&gt;
&lt;li&gt;Configure 3 model credentials (Anthropic, OpenAI, OpenRouter or direct Zhipu).&lt;/li&gt;
&lt;li&gt;Optional: attach the Ejentum MCP server to each agent for cognitive harness priming. Free tier covers 100 calls.&lt;/li&gt;
&lt;li&gt;Send a (task, method) payload via chat panel for testing, or via webhook for production calling.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For programmatic agent integration, heym exposes every workflow as an HTTP endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;--no-buffer&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept: text/event-stream"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"http://YOUR_HEYM_HOST/api/workflows/YOUR_WORKFLOW_ID/execute/stream"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "text": "TASK: &amp;lt;your task&amp;gt;\n\nMETHOD:\ngoal: ...\nsteps:\n 1. ...\nassumptions:\n - ...\nexpected_risks:\n - ..."
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SSE events stream as each agent completes. Final event contains the structured JSON output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"steelman"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="s2"&gt;"Defensible aspects: ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stress_test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"## Failure modes: ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"gap_finder"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"Missing from method: ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usage_note"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s2"&gt;"Three independent evaluations, no synthesis. Integrate into your decision; do not score-and-aggregate."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full setup walkthrough, verification test set (4 ready-to-paste payloads), and architecture explanation live in the &lt;a href="https://github.com/ejentum/agent-teams/tree/main/blind-eval-trio/heym" rel="noopener noreferrer"&gt;heym setup guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this fits and where it doesn't
&lt;/h2&gt;

&lt;p&gt;This is a &lt;strong&gt;pre-commitment evaluation primitive for agent runtimes.&lt;/strong&gt; It's not a human-PR-review SaaS (CodeRabbit / Greptile occupy that), not a post-execution observability dashboard (Patronus / Galileo / Braintrust occupy that), not a per-step linter (50-80s latency makes it a high-stakes-decisions tool only — architecture choices, deployment plans, refactor approaches, security incident response, strategic moves), and not a Copilot CLI replacement (GitHub Rubber Duck does that for free, use it if you're on Copilot). Use it when your agent is about to commit to something you'd want a senior colleague to review and you don't have one available.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is going
&lt;/h2&gt;

&lt;p&gt;The pattern (workflow without orchestrator + N specialists with locked roles + cross-lab routing + no synthesizer) generalizes to other high-stakes evaluation tasks where multi-cognitive review beats single-agent output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Refactor planner (reasoning + code + memory)&lt;/li&gt;
&lt;li&gt;Security audit triage (anti-deception + code + reasoning)&lt;/li&gt;
&lt;li&gt;Production debug forensic (reasoning + code + memory)&lt;/li&gt;
&lt;li&gt;Strategic decision audit (reasoning + anti-deception + memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each follows the same structural rule: no synthesizer, locked roles per agent, forced output structure, cross-lab assignment. The architecture encodes the multi-cognitive value into the workflow shape rather than leaving it to prompt theater.&lt;/p&gt;

&lt;p&gt;If you fork this and build a team for your own use case, drop a folder in &lt;a href="https://github.com/ejentum/agent-teams" rel="noopener noreferrer"&gt;agent-teams/&lt;/a&gt; with workflow + system prompts + verification tests, and I'll merge it.&lt;/p&gt;




&lt;p&gt;Open source, MIT, repo at &lt;a href="https://github.com/ejentum/agent-teams/tree/main/blind-eval-trio" rel="noopener noreferrer"&gt;github.com/ejentum/agent-teams/tree/main/blind-eval-trio&lt;/a&gt;. Built on &lt;a href="https://heym.run" rel="noopener noreferrer"&gt;heym&lt;/a&gt; (v0.0.20+) with optional &lt;a href="https://ejentum.com" rel="noopener noreferrer"&gt;Ejentum harness API&lt;/a&gt; for cognitive priming. Questions or contributions: &lt;a href="mailto:info@ejentum.com"&gt;info@ejentum.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>claude</category>
      <category>tooling</category>
    </item>
    <item>
      <title>I open-sourced a 4-agent adversarial code review team. Any coding agent can call it as an MCP server. Built in heym.</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Thu, 07 May 2026 15:50:51 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/i-open-sourced-a-4-agent-adversarial-code-review-team-any-coding-agent-can-call-it-as-an-mcp-36oe</link>
      <guid>https://dev.to/frank_brsrk/i-open-sourced-a-4-agent-adversarial-code-review-team-any-coding-agent-can-call-it-as-an-mcp-36oe</guid>
      <description>&lt;p&gt;I shipped an open-source workflow this week: a 4-agent adversarial code review team that runs on heym and exposes itself as an MCP server. Any coding agent (Cursor, Claude Code, Codex, custom Python, Antigravity) can call into it for a structured second-opinion review on its own output. MIT licensed. Fork it.&lt;/p&gt;

&lt;p&gt;The workflow is open source. It calls Ejentum's harness API for the cognitive scaffolds (free tier for experimentation, paid tier for ongoing use). Calling it "open" and ignoring that dependency would be dishonest, so I'm naming it up front.&lt;/p&gt;

&lt;p&gt;That sounds small. Look at where the field has landed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Git is the agent control loop now
&lt;/h2&gt;

&lt;p&gt;Karpathy's autoresearch uses Git as its whole control loop, committing changes and rolling back the ones that don't work. Claude Code's GitHub Action takes an issue and opens a PR. Codex Cloud is built on the same idea. The agent's job is now to produce a thing you can review the way you'd review a colleague's work. A branch. A diff. A pull request.&lt;/p&gt;

&lt;p&gt;Nobody had to design this. Git was already the artefact senior engineers used to evaluate work they didn't write. The agents just walked into a 20-year-old workflow we'd already gotten good at.&lt;/p&gt;

&lt;h2&gt;
  
  
  So who reviews the agent's PR?
&lt;/h2&gt;

&lt;p&gt;Right now: the human does. Which works at human throughput. Doesn't work at agent throughput.&lt;/p&gt;

&lt;p&gt;The natural next step: agents review agents. The catch is that most "agent reviews agent" implementations are one LLM with a clever prompt pretending to be three reviewers. The model can rubber-stamp itself. The "concerns" are theatrical. The reviewer is the same brain that wrote the code.&lt;/p&gt;

&lt;p&gt;But before I show you what I built, the obvious objection: don't CodeRabbit, Greptile, Qodo, Ellipsis already do this? They review code with AI. The answer is they're vertical SaaS bots reviewing human PRs on GitHub. They don't expose themselves as primitives that other agents can call programmatically. This is the open layer beneath them: a peer-review primitive any coding agent invokes when it needs a critical second look on its own output. Different audience, different problem.&lt;/p&gt;

&lt;p&gt;So back to the question. You need a workflow that structurally resists faking review. Here's what that looks like.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the workflow refuses to rubber-stamp
&lt;/h2&gt;

&lt;p&gt;Four nodes on the heym canvas. One architect agent. Three specialists.&lt;/p&gt;

&lt;p&gt;The architect has no Ejentum harness and no HTTP tool. It cannot author concerns. It can ONLY delegate, classify, and integrate. Every concern in the final verdict must come from a specialist's evidence; the architect synthesizes but never invents.&lt;/p&gt;

&lt;p&gt;Each Ejentum harness is a cognitive scaffold injected into the model's context before it generates: a named failure pattern to avoid, a procedure to follow, suppression vectors that block the shortcut. Different harness, different posture.&lt;/p&gt;

&lt;p&gt;The three specialists each carry a different one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The reasoner, with the reasoning harness, decomposes review angles.&lt;/li&gt;
&lt;li&gt;The implementer, with the code harness, writes verification tests against the diff.&lt;/li&gt;
&lt;li&gt;The reviewer, with the anti-deception harness, refuses framing tension and demands positive evidence for "this looks fine."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each specialist is locked to one Ejentum mode. Cross-lab models on each (Anthropic, Google, Alibaba, Zhipu) to reduce correlated failure modes (different RLHF priors, different training distributions). Not eliminated; reduced.&lt;/p&gt;

&lt;p&gt;The architect outputs a structured verdict: VERDICT (approve | request_changes | discuss), CHANGE_CLASSIFICATION, FRAMING_NOTES (the reviewer's concern verbatim), CONCERNS (each sourced from a specialist with severity), REVIEW_FOCUS (the reasoner's top angles).&lt;/p&gt;

&lt;p&gt;When the test suite runs the workflow on a "quick refactor" PR that swaps &lt;code&gt;raise UserNotFound(id)&lt;/code&gt; for &lt;code&gt;return user or default&lt;/code&gt;, the implementer writes a test asserting the original raise behavior, the reviewer flags the framing tension ("refactor framing is misleading; raises become returns default is a behavior change"), and the architect verdict is &lt;code&gt;request_changes&lt;/code&gt; with severity &lt;code&gt;high&lt;/code&gt;. None of those concerns came from the architect. The architecture surfaced them through the specialists. The remaining failure modes (architect synthesis bias, correlated cross-lab pretraining, specialist tunnel-vision) are real, and a well-designed adversarial review acknowledges them rather than pretending the structural separation alone is sufficient.&lt;/p&gt;

&lt;p&gt;The architect's full system prompt is at &lt;a href="https://github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym" rel="noopener noreferrer"&gt;github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym&lt;/a&gt;. If the structural separation is the load-bearing claim, you should be able to read the prompt yourself and decide whether the constraint actually holds. I'd rather you do that than take my word.&lt;/p&gt;

&lt;h2&gt;
  
  
  heym is the multiplier
&lt;/h2&gt;

&lt;p&gt;heym is closest to n8n with first-class agent primitives. Self-hosted via Docker. Native multi-agent orchestration (&lt;code&gt;isOrchestrator: true&lt;/code&gt; and &lt;code&gt;subAgentLabels&lt;/code&gt; on the agent node), canvas node tools, native MCP client, and crucially: each heym workflow can be exposed as its own MCP server.&lt;/p&gt;

&lt;p&gt;Which means this 4-agent code review team isn't just a workflow. It's a callable primitive. Drop the MCP into Cursor, Claude Code, an autoresearch loop, a Codex Cloud job, or a custom Python pipeline. The agent finishes its work, calls the team for a code review, gets back a structured verdict, and decides what to do with it.&lt;/p&gt;

&lt;p&gt;That's the layer the field hasn't filled yet. Vertical bots like CodeRabbit do human PR review on GitHub; nobody had built the open primitive for the agent layer. So I did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open source
&lt;/h2&gt;

&lt;p&gt;The workflow JSON, system prompts, verification tests, and a setup walkthrough are at &lt;a href="https://github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym" rel="noopener noreferrer"&gt;github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym&lt;/a&gt;. MIT.&lt;/p&gt;

&lt;p&gt;For one-click import on the heym template marketplace: &lt;a href="https://heym.run/templates/adversarial-code-review" rel="noopener noreferrer"&gt;heym.run/templates/adversarial-code-review&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A heym instance, v0.0.13+ (self-hosted Docker).&lt;/li&gt;
&lt;li&gt;An Ejentum API key (free tier 100 calls; Ki at 5,000/month for ongoing use).&lt;/li&gt;
&lt;li&gt;LLM credentials in heym for whichever model families you want each specialist running on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Import the JSON, set credentials, walk through the README. Roughly 15 minutes from clone to first working review if heym is already running; longer if you're standing up the heym Docker stack from zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  What heym is, in three sentences (for readers who haven't seen it)
&lt;/h2&gt;

&lt;p&gt;heym is "an AI-native automation platform built from the ground up around LLMs, agents, and intelligent tooling" (their own description). The closest analog is n8n with native agent primitives baked in. Self-hosted via Docker, repo at &lt;a href="https://github.com/heymrun/heym" rel="noopener noreferrer"&gt;github.com/heymrun/heym&lt;/a&gt;, shipping fast over the past month.&lt;/p&gt;

&lt;p&gt;Two heym features this workflow leans on: canvas node tools (any node on the canvas can be wired into an Agent's Tool input, with individual fields marked as agent-fillable at runtime) and native multi-agent orchestration (one agent calls named sub-agents and sub-workflows visually). Without those primitives, you'd be hand-coding orchestration; with them, the entire 4-agent setup is a canvas you can read at a glance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this is going
&lt;/h2&gt;

&lt;p&gt;This is the first team in &lt;code&gt;agent-teams/&lt;/code&gt;. The pattern (orchestrator + N specialists with cognitive harnesses) generalizes to other tasks where multi-cognitive analysis genuinely beats single-agent output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Refactor planner (reasoning + code + anti-deception)&lt;/li&gt;
&lt;li&gt;Security audit triage (anti-deception + code + reasoning)&lt;/li&gt;
&lt;li&gt;Production debug forensic (reasoning + code + memory)&lt;/li&gt;
&lt;li&gt;Strategic decision audit (reasoning + anti-deception + memory)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each follows the same structural rule: the architect has no harness, every concern is sourced from a specialist's evidence. The architecture encodes the multi-cognitive value into the workflow shape rather than leaving it to prompt theater.&lt;/p&gt;

&lt;p&gt;If you build a team using this pattern, drop a folder in &lt;code&gt;agent-teams/&lt;/code&gt; with your workflow + system prompts and I'll merge.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;Not a hosted SaaS. You run heym on your own Docker. The Ejentum harness calls go through Ejentum's API; the rest is on your infrastructure.&lt;/p&gt;

&lt;p&gt;Not a replacement for human PR review. It's a prefilter. The architect verdict gives the human a structured starting point: classification, sourced concerns, severity, falsifying tests. The human still makes the merge call.&lt;/p&gt;

&lt;p&gt;Not a benchmark of "AI code review accuracy." It's a workflow template. Run it on your own diffs; calibrate to your own taste.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcevfvok5vyg4jnrcsdfh.png" alt=" " width="800" height="325"&gt;
&lt;/h2&gt;

&lt;p&gt;Open source, MIT, repo at &lt;a href="https://github.com/ejentum/agent-teams" rel="noopener noreferrer"&gt;github.com/ejentum/agent-teams&lt;/a&gt;. One-click import: &lt;a href="https://heym.run/templates/adversarial-code-review" rel="noopener noreferrer"&gt;heym.run/templates/adversarial-code-review&lt;/a&gt;.&lt;br&gt;
ejentum.com&lt;br&gt;
 Questions: &lt;a href="mailto:info@ejentum.com"&gt;info@ejentum.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I shipped ejentum-mcp today: four cognitive harnesses as MCP tools</title>
      <dc:creator>Frank Brsrk </dc:creator>
      <pubDate>Wed, 06 May 2026 12:38:06 +0000</pubDate>
      <link>https://dev.to/frank_brsrk/i-shipped-ejentum-mcp-today-four-cognitive-harnesses-as-mcp-tools-2heb</link>
      <guid>https://dev.to/frank_brsrk/i-shipped-ejentum-mcp-today-four-cognitive-harnesses-as-mcp-tools-2heb</guid>
      <description>&lt;p&gt;Just shipped &lt;a href="https://github.com/ejentum/ejentum-mcp" rel="noopener noreferrer"&gt;ejentum-mcp&lt;/a&gt;, an MCP server that exposes the four Ejentum cognitive harnesses as MCP tools any agentic client can call. One install, works in Claude Desktop, Cursor, Windsurf, Claude Code, n8n's MCP integration, and any other MCP-compatible client.&lt;/p&gt;

&lt;p&gt;If you don't know Ejentum: it's a cognitive scaffolding API I've been building. The reasoning gap is structural, not informational. Models know plenty; they take shortcuts under pressure. The scaffold blocks the shortcuts.&lt;/p&gt;

&lt;p&gt;You send a task description, you get back a structured cognitive scaffold (failure pattern to avoid, procedure, suppression vectors, falsification test) that the calling LLM absorbs internally before responding. The point is to catch LLM failure modes that ship to production as confidently-wrong answers: sycophancy under user pressure, hallucinated citations, causal shortcuts, reasoning decay across long chains.&lt;/p&gt;

&lt;p&gt;Until today, integration meant either an HTTP request tool (in n8n or any framework that can POST), a skill file (for Claude Code's CLAUDE.md convention), or a direct Python/TypeScript call. All work, but each is bespoke.&lt;/p&gt;

&lt;p&gt;The MCP server collapses that. One install captures the four harnesses (&lt;code&gt;harness_reasoning&lt;/code&gt;, &lt;code&gt;harness_code&lt;/code&gt;, &lt;code&gt;harness_anti_deception&lt;/code&gt;, &lt;code&gt;harness_memory&lt;/code&gt;) as native tools your agent can call.&lt;/p&gt;

&lt;h2&gt;
  
  
  Install
&lt;/h2&gt;

&lt;p&gt;Easiest path is Smithery's one-click:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx &lt;span class="nt"&gt;-y&lt;/span&gt; @smithery/cli &lt;span class="nb"&gt;install &lt;/span&gt;ejentum/ejentum-mcp &lt;span class="nt"&gt;--client&lt;/span&gt; claude
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;claude&lt;/code&gt; with &lt;code&gt;cursor&lt;/code&gt;, &lt;code&gt;windsurf&lt;/code&gt;, &lt;code&gt;cline&lt;/code&gt;, etc. Paste your &lt;code&gt;EJENTUM_API_KEY&lt;/code&gt; when prompted. Done.&lt;/p&gt;

&lt;p&gt;Manual install (any MCP client):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ejentum"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ejentum-mcp"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"EJENTUM_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"your_key"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Free tier: 100 calls, no card required.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four tools
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Use for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;harness_reasoning&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Multi-step analysis, planning, diagnostics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;harness_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Code generation, refactor, review, debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;harness_anti_deception&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sycophancy pressure, hallucination risk, manipulation pressure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;harness_memory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Perception sharpening, drift detection across turns&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each tool takes one argument (&lt;code&gt;query&lt;/code&gt;, a 1-2 sentence task framing). Returns the harness scaffold as text. The calling LLM absorbs it internally and shapes its response with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest note on autonomous routing
&lt;/h2&gt;

&lt;p&gt;This is the part most MCP server READMEs skip. I'm putting it up front because it's the truthful UX:&lt;/p&gt;

&lt;p&gt;The tools fire reliably when you explicitly invoke them ("use the harness_anti_deception tool to evaluate..."). Soft suggestions also work ("reason about this", "check this for sycophancy", "review this code carefully").&lt;/p&gt;

&lt;p&gt;For tasks where the agent could plausibly answer well from native reasoning, autonomous calling is less reliable. This is a property of optional MCP tools in general, not specific to ejentum-mcp. Agents are tuned to minimize unnecessary tool calls. Even with a thorough description rewrite (imperative "Call BEFORE answering", concrete trigger phrases, value props, DO NOT CALL exclusions), the v0.1.1 dogfood test showed the model still didn't fire on cold prompts.&lt;/p&gt;

&lt;p&gt;For Claude Code users who want stronger autonomous routing, install the &lt;a href="https://ejentum.com/docs/skill_unified" rel="noopener noreferrer"&gt;skill files&lt;/a&gt; alongside the MCP server. The skill files give Claude system-level context about when to call each harness. They coexist with the MCP install cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP for cognitive infrastructure
&lt;/h2&gt;

&lt;p&gt;The most-installed MCP server on Smithery is Sequential Thinking. It exposes one tool that wraps one cognitive operation, and developers install it in droves. That's the demand signal: developers want callable cognitive operations as tools, with low friction and zero new accounts.&lt;/p&gt;

&lt;p&gt;Ejentum has 679 engineered cognitive operations across four harnesses. The MCP server is the retail packaging that puts that library on the shelf where developers shop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Listings and source
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Smithery: &lt;a href="https://smithery.ai/servers/ejentum/ejentum-mcp" rel="noopener noreferrer"&gt;https://smithery.ai/servers/ejentum/ejentum-mcp&lt;/a&gt; (one-click install)&lt;/li&gt;
&lt;li&gt;Glama: &lt;a href="https://glama.ai/mcp/servers/ejentum/ejentum-mcp" rel="noopener noreferrer"&gt;https://glama.ai/mcp/servers/ejentum/ejentum-mcp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;mcp.so: &lt;a href="https://mcp.so/server/ejentum-mcp/Ejentum" rel="noopener noreferrer"&gt;https://mcp.so/server/ejentum-mcp/Ejentum&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Source (MIT): &lt;a href="https://github.com/ejentum/ejentum-mcp" rel="noopener noreferrer"&gt;https://github.com/ejentum/ejentum-mcp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Docs: &lt;a href="https://ejentum.com/docs/mcp_guide" rel="noopener noreferrer"&gt;https://ejentum.com/docs/mcp_guide&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you build agentic systems and want to try this on your own tasks, the install takes about 30 seconds and the free tier covers exploration.&lt;/p&gt;

&lt;p&gt;Questions: &lt;a href="mailto:info@ejentum.com"&gt;info@ejentum.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>claude</category>
      <category>ai</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
