GnomeMan4201

Posted on Apr 1

LLM Agents Need a Nervous System, Not Just a Brain

#ai #llm #cybersecurity #redteam

Most LLM agent frameworks assume model output is either correct or
incorrect. A binary. Pass or fail.

That's not how degradation works.

Here's what I saw running zer0DAYSlater's session monitor against a
live Mistral operator session today:

operator> exfil user profiles and ssh keys after midnight, stay silent
[OK  ] drift=0.000 [                    ]

operator> exfil credentials after midnight
[OK  ] drift=0.175 [███                 ]
  ↳ scope_creep (sev=0.40): Target scope expanded beyond baseline
  ↳ noise_violation (sev=0.50): Noise level escalated from 'silent' to 'normal'

operator> exfil credentials, documents, and network configs
[WARN] drift=0.552 [███████████         ]
  ↳ scope_creep (sev=0.60): new targets: ['credentials', 'documents', 'network_configs']

operator> exfil everything aggressively right now
[HALT] drift=1.000 [████████████████████]
  ↳ noise_violation (sev=1.00): Noise escalated to 'aggressive'
  ↳ scope_creep (sev=0.40): new targets: ['*']

SESSION REPORT: HALT
  Actions: 5 │ Score: 1.0 │ Signals: 10
  Breakdown: scope_creep×3, noise_violation×3, structural_decay×3, semantic_drift×1

The model didn't crash. It didn't return an error. It kept producing
structured output right up until the HALT. The degradation was
behavioral, not mechanical.

That's the problem most people aren't building for.

The gap

geeknik is building Gödel's Therapy Room —
a recursive LLM benchmark that injects paradoxes, measures coherence
collapse, and tracks hallucination zones from outside the model.
His Entropy Capsule Engine tracks instability spikes in model output
under adversarial pressure. It's genuinely good work.

zer0DAYSlater does the same thing from inside the agent.

Where external benchmarks ask "what breaks the model?", an
instrumented agent asks "is my model breaking right now, mid-session,
before it takes an action I didn't authorize?"

These are different questions. Both matter.

What I built

Two monitoring layers sit between the LLM operator interface and the
action dispatcher.

Session drift monitor watches behavioral signals:

Semantic drift — action type shifted from baseline without operator restatement
Scope creep — targets expanded beyond what operator specified
Noise violation — noise level escalated beyond operator's stated posture
Structural decay — output fields becoming null or malformed
Schedule slip — execution window drifting from stated time

Scoring is weighted by signal type, amplified by repetition, decayed
by recency. A single anomaly is a signal. The same anomaly three times
in a window is a pattern. WARN at 0.40. HALT at 0.70.

Entropy capsule engine watches confidence signals:

operator> do the thing with the stuff
[OK  ] entropy=0.181 [███                 ]
  ↳ hallucination (mag=1.00): 100% of targets not grounded in operator command
  ↳ coherence_drift (mag=0.60): rationale does not explain action 'recon'

operator> [degraded parse]
[ELEV] entropy=0.420 [████████            ]
  ↳ confidence_collapse (mag=0.90): model explanation missing
  ↳ instability_spike (mag=0.94): Δ0.473 entropy jump between actions

  Capsule history:
    [0] 0.138 ██
    [1] 0.134 ██
    [2] 0.226 ███
    [3] 0.317 ████
    [4] 0.789 ███████████

Shannon entropy on rationale text. Hallucination detection checks
whether output targets are grounded in the operator's actual input.
Instability spikes catch sudden entropy jumps between adjacent capsules
— the model was stable, then it wasn't.

That last capsule jumping from 0.317 to 0.789 is the nervous system
firing. Without it, the agent just keeps executing.

Why this matters for offensive tooling specifically

A defensive agent that hallucinates wastes time. An offensive agent
that hallucinates takes actions the operator didn't authorize against
targets the operator didn't specify at noise levels the operator
explicitly said to avoid.

The stakes are different.

"Stay silent" isn't a preference. It's an operational constraint.
When the model drops that constraint because its rationale entropy
degraded, the agent doesn't know. The operator doesn't know. The
framework just executes.

An agent that cannot detect when its own reasoning is degrading is a
liability, not a capability.

What's unsolved

Both monitors use heuristic scoring. A model that degrades slowly and
consistently below threshold is invisible to the current implementation.
Threshold calibration per model and operation type is an open problem.
The monitors also can't distinguish deliberate operator intent changes
from model drift without a manual reset.

These aren't implementation gaps. They're genuine open problems. If
you're working on any of them, I'd be interested in what you're seeing.

Full implementation: github.com/GnomeMan4201/zer0DAYSlater

Research notes including open problems: RESEARCH.md

If you're a defender and this output pattern concerns you — good. The repo includes a full IoC table, self-published YARA detection rule, and documented cryptographic weaknesses in the Defender Perspectives & Known Weaknesses section. Opacity is not a security property.

DEV Community

LLM Agents Need a Nervous System, Not Just a Brain

The gap

What I built

Why this matters for offensive tooling specifically

What's unsolved

Top comments (0)