Frank Brsrk

Posted on May 21

I built a reasoning harness for LLM agents. Here's what an agent receives when it calls it.

#ai #agents #llm #mcp

Most LLM agent failures aren't model failures. They're shape-of-reasoning failures.

Sycophancy. Drift under multi-turn pressure. Doubling down on hallucinations. Ignoring a critical RAG document. These aren't bugs that a model update fixes. They're structural properties of how the substrate generates tokens left to right with no internal verification step. You can't patch them with a better system prompt.

I built Ejentum to intervene at the layer where these failures actually live: a reasoning harness for LLM agents. An external API that delivers a structured cognitive operation to the agent at inference time, mid-task. No fine-tuning. No new model.

Here's what an agent receives when it calls the harness, in 8 frames.

Same model. Different reasoning.

Same prompt, same temperature. A cognitive operation drops into the agent's context between prompt and response. Works on any modern LLM that follows structured instructions (Claude, GPT, Gemini, Llama).

The catalog

The agent posts a short task statement to the API. Behind it sits a catalog of 679 cognitive operations across four modes — 311 in reasoning alone, 128 in code, 139 in anti-deception, 101 in memory. The API embedding-matches your task to the one operation that fits. Stateless, one per call.

curl -X POST https://api.ejentum.com/logicv1 \
-H "Authorization: Bearer $EJENTUM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"query": "Engineering lead insists we keep the legacy Postgres setup because we have invested 18 months in it. About to recommend either continuing or executing the rewrite.",
"mode": "reasoning"
}'

What comes back

Six structured fields land in the agent's context before it generates a single token:

NEGATIVE GATE — the named failure mode this operation prevents
PROCEDURE — numbered reasoning steps in plain English
REASONING TOPOLOGY — the same steps as an executable DAG
TARGET PATTERN — what correct reasoning looks like
FALSIFICATION TEST — a self-check on the agent's draft
AMPLIFY / SUPPRESS — continuous biases during generation

A real injection

The catalog matched the Postgres query above to a simulation-mode operation about preserving optionality under irreversible commitment. Here's an excerpt of what the agent actually received:

[NEGATIVE GATE]
Committing the entire infrastructure budget to AWS with a three-year contract
locks in the best pricing and simplifies our architecture.

[PROCEDURE]
Step 1: List all available strategic paths; note whether each is reversible
or irreversible.
Step 2: Simulate outcomes under at least 3 scenarios.
Step 3: Score on flexibility, upside, downside. Combine into optionality score.
Step 4: If any high-optionality path is about to be foreclosed, flag immediately.
Step 5: Recommend the action maximizing optionality-adjusted expected value.

[REASONING TOPOLOGY]
S1:list_paths → CLASSIFY(reversible | irreversible) → FORK
→ M{anchored to optimistic?} --working→ S2b:simulate_pessimistic
--failing→ FREEFORM → RE-ENTER at S2b
→ JOIN → C{optionality_score} → G1{high_path_foreclosing?}
→ OUT:balanced_portfolio

[FALSIFICATION TEST]
If a decision commits to a single path without preserving reversible
alternatives, optionality balancing was bypassed.

Amplify: portfolio diversity, upside capture, downside protection
Suppress: single path optimization, commitment premium
This is the literal response, not pseudocode.

The agent walks the topology

The agent doesn't read the topology. It walks it. Each node is a step the model performs in its own reasoning trace. Decision gates branch on real conditions. Parallel branches run and rejoin.

The load-bearing piece: meta-cognitive checkpoints (M-nodes) where the model pauses mid-reasoning, observes its own state, and branches on the answer. On benchmark MC-016 this lifted the score to 22/25 against a 19/25 baseline — a +3 lift just from making meta-cognition mandatory inside the procedure rather than optional outside it.

Three corrections, in parallel

The six fields group into three corrections that fire at the same time while the model is still writing:

Trajectory — bends the response from wrong shape to right
Process — gives the model a sequence to walk
Output control — gates the draft, blocks the model's default agreeable behavior
This is what separates the harness from output validators (which check after generation) and system prompts (which advise before but don't shape generation itself).

The schedule

Each directive fires at a specific moment in the inference loop. Add this to your agent's system prompt:

When an Ejentum cognitive operation arrives in your context:

walk_topology = node_by_node # do not paraphrase the DAG
m_nodes = mandatory_pause # branch on the self-observation answer
suppress = hard_refusal_list # not a suggestion, refuse outright
falsify = gate_before_emit # if test fails, re-walk the topology
augment = scaffold_only # the response is still your output
Five rules. Five different temporal shapes. The contract is a schedule, not a list.

Ship it

Three integration paths, depending on your stack:

1. Stdio MCP for IDE-native agents

npx -y ejentum-mcp

→ Claude Code, Cursor, Codex, Antigravity, Cline, Windsurf, Continue

2. Hosted HTTPS MCP for workflows

curl https://api.ejentum.com/mcp \
-H "Authorization: Bearer $EJENTUM_API_KEY"

→ n8n MCP Client, Heym, remote agents

3. Python SDK for CrewAI and custom agents

pip install crewai-ejentum
Free tier: 100 calls, no card required.

The harness doesn't make a model smarter. It prevents a model from getting dumber over the length of a real task.

If you're shipping anything multi-turn under pressure — medical reasoning, code review, financial recommendations, legal analysis — the reasoning layer needs structural support that doesn't depend on the model getting it right on its own.

Drop a scenario in the comments and I'll pick one and run it end-to-end as a follow-up.

Links

ejentum.com
github.com/ejentum/ejentum-mcp
Paper: Under Pressure

Top comments (2)

Harjot Singh • May 31

"Here's what an agent receives when it calls it" is the right thing to show, because the whole game with a reasoning harness is the contract: what context, structure, and constraints you hand the model at the moment of the call. Most "agent" code is an unstructured prompt and a hope; a real harness shapes the input (relevant state, the tools available, the format it must return, the guardrails) so the model's reasoning happens inside rails instead of free-floating. Showing the actual payload an agent receives is way more honest and useful than another architecture diagram, because that payload IS the system.

This is exactly my thesis - the leverage is in the harness, not the model. A well-built harness makes a cheaper model outperform a bigger model with no scaffolding, every time. It's the core of Moonshift, the thing I build: a multi-agent pipeline that takes a prompt to a deployed SaaS, where each agent gets a structured, verified payload and a verify layer gates the output before it propagates - so reasoning is bounded and checkable, not vibes. Multi-model routing keeps a build ~$3 flat, first run's free no card. Really like that you're exposing the call contract. Two questions: is the structure of what-the-agent-receives fixed, or does it adapt per task/step? And do you verify the agent's output against that contract, or just trust the response shape? The output-side check is what turns a harness into a guardrail.

Frank Brsrk • Jun 2

exactly brother Harjot Singh, is what u apply as constraints that maximize the performance. there are typically many architectures on how to steer an llm. is how u target and how well u pull out of the ai, since i am good with data i do that by give it reasoning maps based on the task. which enforces reasoning process where could premature convergence and cognitive shortcuts take place.