DEV Community

Alex Delov
Alex Delov

Posted on

Hermes Agent Needs a Flight Recorder - So I Built One

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge

Autonomous agents can now write code, call tools, browse the web, mutate files, and delegate to subagents. But when they fail, they fail invisibly.

"An agent ran overnight, caught an unhandled exception loop, and burned $50 in tokens while corrupting our staging database."

If you've spent more than a week building production systems with autonomous agents, you've lived some version of this nightmare.

Most agent runtimes don't crash cleanly. They slide into retry storms, silently ignore failed tool calls, or recurse through delegation loops until budgets evaporate.

Airplanes have flight recorders. Distributed systems have OpenTelemetry. Autonomous agents need TraceGuard.


What I Built

TraceGuard is a lightweight Python library and CLI that acts as an isolated, non-invasive execution flight recorder for autonomous agent runtimes.

It consumes append-only JSONL execution traces and detects the three silent killers of agentic workflows:

  • Retry Storms
  • Silent Failures
  • Recursive Delegation Cycles
traceguard traces/my_agent_run.jsonl --strict
# exit 0 = clean · exit 1 = WARN · exit 2 = CRITICAL
Enter fullscreen mode Exit fullscreen mode

Instead of scraping human-readable terminal logs, TraceGuard turns runtime execution into a structured, replayable execution event contract.

GitHub: https://github.com/Ale007XD/traceguard


The Problem Nobody Talks About

Modern agent frameworks can browse the web, write files, execute shell commands, and coordinate sub-agents. But when something goes wrong, you're usually left with a giant wall of terminal output and one impossible question:

What actually happened?

Not what the LLM said. Not the final output. The actual execution state:

  • What tool calls executed?
  • Which failures were silently ignored?
  • Where did the retry loop begin?
  • Which sub-agent delegated back into itself?

Distributed systems engineers solved these problems decades ago using structured traces, append-only logs, and replayable execution histories. Agent runtimes are now complex enough to require the same discipline.


The Mental Model

Autonomous agents are stochastic distributed runtimes.

Distributed System Failure Agent Equivalent Observability Primitive
Retry storm Same tool called repeatedly without progress Sliding window counter over event stream
Silent failure Tool fails, agent continues anyway Error propagation trace
Circular dependency Agent A delegates to B which delegates back to A Delegation cycle detection
State divergence Agent acts on corrupted or stale state Replayable transition history

δ(S, E) → S'

Agent Runtime
      │
      ▼
Append-Only Event Stream
      │
      ▼
  TraceGuard
      │
  ┌───┴───────┬──────────────┐
  ▼           ▼              ▼
Retry      Silent       Recursive
Storms    Failures     Delegation
Enter fullscreen mode Exit fullscreen mode

Every execution step becomes a formal state transition. The runtime stops being an opaque, ephemeral process and becomes a replayable execution artifact.


The Missing Primitive

Hermes Agent currently exposes beautiful terminal output optimized for humans. Production observability requires something fundamentally different: machine-readable execution semantics.

Example event:

{
  "event_id": "3f8a1c2d-...",
  "session_id": "hermes-session-001",
  "timestamp": "2026-05-29T10:00:00.050Z",
  "schema_version": "1.0",
  "type": "tool_call",
  "tool_name": "bash",
  "tool_args": {
    "command": "git status --porcelain"
  }
}
Enter fullscreen mode Exit fullscreen mode

Each event is:

  • Immutable — append-only after creation
  • Self-describing — schema versioned and typed
  • Replayable — execution can be reconstructed offline
  • Composable — detectors operate over the same event stream

The missing primitive is not another dashboard. It is a structured execution event stream.


Three Detectors. One Governance Layer.

Retry Storm Detector

Detects identical tool invocations repeating without successful progress.

Example: bash → failbash → failbash → fail (retry storm)

Silent Failure Detector

Detects agents continuing execution after failed or empty tool outputs.

Example: read_file → emptycontinue execution (silent corruption)

Recursive Delegation Detector

Detects sub-agent delegation cycles and self-recursion.

Example: planner → coder → coder → planner (recursive loop)

Each detector operates independently over the same append-only event stream. Multiple detectors can fire simultaneously on the same execution trace.


Execution Governance

TraceGuard is intentionally designed as an external execution observer.

  • No monkey-patching
  • No framework lock-in
  • No invasive runtime hooks
  • No dependency on Hermes internals
LLM proposes
      │
      ▼
Runtime executes
      │
      ▼
TraceGuard observes
      │
      ▼
Governance layer enforces invariants
Enter fullscreen mode Exit fullscreen mode

This is the critical distinction. Prompt engineering cannot reliably solve retry storms, hidden execution corruption, or delegation cycles. Prompt-layer control is insufficient. Execution-layer governance is required.


Architecture

  • TraceEvent (schema.py) — Immutable Pydantic v2 execution events
  • TraceRecorder (recorder.py) — Append-only JSONL persistence
  • Detectors (detectors.py) — Streaming anomaly detectors
  • TraceGuard (guard.py) — Batch + real-time governance pipeline

The core invariant is simple: Record every transition. Analyze the record.

Once execution becomes replayable, agent runtimes stop behaving like black boxes.


How This Connects to Hermes

Hermes Agent currently produces terminal output optimized for human inspection. TraceGuard proposes a complementary execution event contract — a machine-readable stream of typed, versioned, append-only events emitted alongside the human-readable output.

This aligns with the discussion in issue #169 on structured execution semantics.

The integration path is additive: TraceGuard requires no changes to Hermes internals. Emit events to a JSONL file; TraceGuard reads them externally.


Demo

$ traceguard traces/retry_storm.jsonl
[WARN] RetryStormDetector: tool 'bash' called 4 times without success (threshold=3)
[WARN] SilentFailureDetector: step 2 failed, execution continued without error handling
[WARN] SilentFailureDetector: step 4 failed, execution continued without error handling
[WARN] SilentFailureDetector: step 6 failed, execution continued without error handling
[WARN] SilentFailureDetector: step 7 failed, execution continued without error handling

$ traceguard traces/recursive_delegation.jsonl
[CRITICAL] RecursiveDelegationDetector: delegation cycle detected — planner → coder → planner

$ traceguard traces/clean.jsonl
✓ No anomalies detected.

$ traceguard traces/retry_storm.jsonl --strict; echo "exit: $?"
exit: 1
Enter fullscreen mode Exit fullscreen mode

Code

from traceguard import TraceGuard

guard = TraceGuard()
report = guard.analyze("traces/my_agent_run.jsonl")

for anomaly in report.anomalies:
    print(f"[{anomaly.severity}] {anomaly.detector}: {anomaly.message}")

if report.is_clean:
    print("✓ No anomalies detected.")
Enter fullscreen mode Exit fullscreen mode

My Tech Stack

  • Python 3.10+ — minimum target, tested on 3.14
  • Pydantic v2 — immutable frozen=True event models
  • Typer + Rich — CLI with structured terminal output
  • JSONL — append-only trace persistence format
  • pytest — 13/13 tests passing
  • hatchling — packaging

No external runtime dependencies. No framework lock-in.


How I Used Hermes

TraceGuard was developed and iterated with Hermes Agent as the primary development environment — reading files, applying patches, running tests, and diagnosing failures through FSM-structured execution loops.

The irony is deliberate: a tool for governing agent execution traces was built by an agent whose execution was governed by the same FSM principles.

Hermes drove: reading source files → generating S&R patches → applying changes → running pytest → diagnosing failures → iterating.


Why This Matters

Most failures in autonomous systems are not model failures. They are execution failures:

  • Infinite retries
  • Ignored exceptions
  • Corrupted state propagation
  • Delegation recursion
  • Unbounded token burn

The model is usually doing exactly what it was asked to do. The runtime simply lacks governance.

"LLMs propose. Runtimes govern."


What Comes Next

  • Replay Engine — Re-execute traces against patched tool implementations
  • Behavioral Regression Testing — Compare execution traces across models and versions
  • OpenTelemetry Export — Emit OTLP spans for Grafana, Datadog, and distributed tracing platforms

TraceGuard is to autonomous agents what OpenTelemetry became for distributed systems.


Built for the Hermes Agent Challenge 2026.

Repository: https://github.com/Ale007XD/traceguard

Built on llm-nano-vm — deterministic FSM execution infrastructure.

Top comments (0)