DEV Community

 HJS Foundation
HJS Foundation

Posted on

I Logged Every Decision My AI Agent Made for a Week. Here's What I Learned.

10,847 decision events. 3 surprising insights. And one $23 wake-up call that changed how I think about agent observability.


The $23 Mystery

I run a multi-agent system that does market research. Three agents, one goal:

  • Scout: Gathers data from APIs and web sources
  • Analyst: Processes raw data into insights
  • Writer: Produces the final report

It worked fine. Until it didn't.

One Monday morning, I found a report that was 48 hours late and cost $23 in API credits. Normal runs take 2 hours and cost around $4.

What happened?

I checked everything. API rate limits? No. Model downtime? No. LangSmith traces showed the chain completed successfully. Each agent reported "task done." Every log line was green.

But somewhere between "task done" and "report ready," 46 hours vanished.

That's when I realized: I had no idea what my agents were actually deciding to do. I only knew what they did.

So I ran an experiment.


The Experiment: 50 Lines of Code, One Week, Every Decision

I added a lightweight decision logger to my agent orchestrator. Not tracing API calls—we already have that. I wanted to log decisions:

  • J (Judge): An agent initiates a new task or makes a determination
  • D (Delegate): An agent hands off work to another agent
  • T (Terminate): An agent ends a task, successfully or not
  • V (Verify): An agent validates someone else's output

Here's the core code (simplified—full version on GitHub):

import json
import hashlib
import time
import asyncio
from uuid import uuid4

def hash_content(content: str) -> str:
    """Create a content-addressable hash for the decision payload."""
    return f"sha256:{hashlib.sha256(content.encode()).hexdigest()[:16]}"

async def log_decision(verb: str, who: str, what: str, ref: str = None):
    event = {
        "jep": "1",
        "verb": verb,
        "who": who,
        "what": hash_content(what) if what else None,
        "when": int(time.time()),
        "nonce": str(uuid4()),
        "ref": ref,
        "aud": "research-pipeline-v1"
    }
    # Async write—doesn't block the agent
    await write_to_ndjson(event)
Enter fullscreen mode Exit fullscreen mode

I deployed it on a Tuesday. One week later, I had 10,847 decision events.

Here's what I found.


Discovery #1: 35% of Delegations Were Circular

My agents delegate work to each other constantly. Scout hands raw data to Analyst. Analyst hands insights to Writer. Writer asks Scout for clarification. Normal.

But when I graphed the D (Delegate) events by ref chain, I saw something unexpected:

Scout → Analyst → Scout → Analyst → Scout (terminates)
Enter fullscreen mode Exit fullscreen mode

1,203 times in one week, agents created delegation loops of length ≥ 2. Each loop burned:

  • ~2 seconds of compute time
  • One extra LLM call for the handoff reasoning
  • Token costs for the delegation message itself

Total waste: ~40 minutes of compute and $3.20 in API costs. Not catastrophic. But completely invisible until I logged the D events with their ref chains.

The fix: I added a simple rule—if an agent receives a delegation from someone it already delegated to in the current chain, break and escalate. Loops dropped to near zero.


Discovery #2: Failed Tool Calls Retried 7 Times Before Giving Up

One of Scout's jobs is scraping competitor pricing from public websites. Occasionally, a site times out. Normal.

What wasn't normal: the retry behavior.

When a tool call failed, my agent retried—on average—7 times before terminating. The worst offender was that scraping tool. One timeout at 11:23 PM turned into:

11:23 PM - Tool call fails (timeout)
11:24 PM - Retry 1 fails
11:26 PM - Retry 2 fails
11:29 PM - Retry 3 fails
...
03:17 AM - Retry 11 fails, agent finally terminates
Enter fullscreen mode Exit fullscreen mode

Four hours. Eleven retries. Each one a fresh API call with a new browser instance. Cost of that single failure chain: $1.87.

Across the week, excessive retries wasted ~$9.40.

The fix: I capped retries at 3 for non-critical tools. If it still fails, the agent logs a T with reason: "tool_unavailable" and moves on with partial data. The report might be slightly less complete, but it arrives on time and under budget.


Discovery #3: The 3 AM Termination Storm

At 3:14 AM on Wednesday, I saw something strange in the logs:

47 T (Terminate) events within 90 seconds.

Normal rate is ~10 per hour.

Every single one had reason: "empty_response". Turns out, a data provider's API had a brief outage, returning 200 OK with an empty body. Every parallel agent hit it simultaneously, received nothing, and terminated immediately.

No alert fired. From the orchestrator's perspective, all tasks "completed successfully"—they just completed with zero data. The final report that morning was 40% shorter than usual, and I had no idea why until I dug through the decision logs.

The fix: I added a simple monitor—if T events with reason: "empty_response" exceed 5 per minute, pause the pipeline and alert. The next time that API flaked, I knew within 60 seconds.


Discovery #4: Verification Was Silently Slowing Down

The V (Verify) event happens when one agent checks another's output. Analyst produces insights; Writer verifies they're coherent before including them.

I noticed something in the timestamps:

Day Avg Time Between J and V
Tuesday 1.2 seconds
Wednesday 1.8 seconds
Thursday 2.9 seconds
Friday 4.1 seconds
Monday 4.7 seconds

The verification service was drifting. Not enough to break anything yet, but a clear trend. Turned out the vector database used for fact-checking had accumulated 6 months of stale embeddings and queries were getting slower.

Without decision-level timestamps, I would have found out only after it started timing out and breaking the pipeline. Instead, I scheduled a re-indexing job over the weekend. Tuesday morning: back to 1.3 seconds.


What I Changed (And What You Can Steal)

I didn't build a complex observability platform. I added three rules to my orchestrator based on what the decision logs revealed:

Rule Trigger Action
Circular delegation guard D chain contains duplicate who Break loop, escalate
Retry cap Tool call fails > 3 times Log T, continue with partial data
Termination storm alert > 5 T with same reason in 1 min Pause pipeline, notify

The next week's run: 1.8 hours. $3.70. And I caught an API outage before it silently corrupted a report.


Why This Matters (And Why Most Agent Logs Are Useless)

Here's what I learned: there's a difference between logging actions and logging decisions.

Action Log Decision Log
"Called API X" "Delegated to Analyst because confidence < 0.7"
"Task completed" "Terminated with partial data due to tool timeout"
"Received response" "Verified output—hash matches, coherence score 0.82"

Action logs tell you what happened. Decision logs tell you why.

Without the "why," debugging multi-agent systems is just guesswork.


How to Get Started (Without Adopting a New Protocol)

You don't need to rebuild your entire stack. Start simple:

Level 1: Structured handoff logs

Add one JSON line every time an agent hands off work to another. Include from, to, reason, and a hash of the payload. That alone will catch delegation loops.

Level 2: Add decision verbs

Tag each log with what kind of decision it represents: initiated, delegated, terminated, verified. This makes it searchable and graphable.

Level 3: Chain them together

Use a ref field to link events. Now you have a trace of the entire decision chain, not just isolated events.

Level 4: Add signatures (if you need non-repudiation)

If you're building something where audit trails matter—compliance, finance, multi-party systems—you'll want cryptographic signatures. The format I used above is compatible with JEP (Judgment Event Protocol), which adds signing and anti-replay protection out of the box. But you can get 80% of the value with plain JSON and a ref field.


I Open-Sourced the Logger

The logger I used for this experiment is now open-source. It's 200 lines of Python, works with any agent framework, and writes to NDJSON so you can cat and grep like it's 1999.

👉 GitHub: agent-decision-logger

It includes:

  • The core logger with all four decision verbs
  • A Mermaid visualization script (see your agents' decision chains as flowcharts)
  • Analysis tools to detect delegation loops and termination storms
  • Complete examples and tests

Your Turn

My agent wasn't broken. It was just making expensive decisions I couldn't see.

What's the weirdest thing you've found in your agent's logs? Or are you flying blind?

Drop a comment—I'd love to hear what you're seeing (or not seeing) in your own systems.


Top comments (0)