DEV Community

Jarvis Specter
Jarvis Specter

Posted on

Your AI Agents Are Talking to Each Other. Here's How to Find Out What They're Saying.

Last week, someone on r/AI_Agents posted this:

"My company is spending $12k/month on AI agents and I have no idea what half of them are actually doing."

151 upvotes. 57 comments. Everyone nodding.

The post wasn't a sob story about bad vendors or broken models. It was something more uncomfortable: a confession that at some point, the agent stack became too big to understand. Agents were calling other agents. Costs were climbing. And the founder had lost the thread of what was actually producing value vs. what was just... talking.

I've been running 23 agents in production for over a year. I've lived this. Here's the audit framework that saved me — and the thing nobody warns you about before you get there.


The "Talking to Each Other" Problem

Most agent stack horror stories start the same way: you add one agent, it works great, you add another to help the first one, then a third to orchestrate the first two... and three months later you're staring at a Slack message saying your API bill doubled and you can't explain why.

The problem isn't that agents are bad. The problem is that inter-agent communication is invisible by default.

When Agent A calls Agent B to summarise something, which calls Agent C to fetch context, which loops back to Agent A with a clarifying question — you've just used 15,000 tokens to do something a single well-crafted prompt could have done in 800. And you'll never know, because none of the standard dashboards show you the graph — only the edges.

Before I built out monitoring, I had exactly this running. A research agent, a summarisation agent, and a "quality check" agent that would reject summaries and send them back for rework. In theory: elegant. In practice: a loop that sometimes ran 7 cycles on a single document before producing output. At GPT-4 prices, that's not elegant — it's expensive.


Step 1: Map the Conversation Graph

The first thing to do is brutal and manual: draw every agent-to-agent communication path you have.

You're looking for:

  • Which agents call other agents (direct invocations, not just shared tools)
  • What triggers each call (event, scheduled, reactive)
  • Whether there's a termination condition or whether it's just "run until done"

This doesn't need to be fancy. A whiteboard works. What you're looking for are cycles — paths that can loop back on themselves. Any cycle without a hard limit is a potential runaway.

In practice, I've found that most teams have 2-4 inter-agent cycles they didn't know existed. They emerged organically as features were added. The cycle only becomes visible when the bill arrives.

Quick audit tool: Run your agent stack for 24 hours with verbose logging on. Search your logs for any agent ID that appears as both a caller and a callee. That's your list of suspects.


Step 2: Classify Your Agents by Output Type

Here's a distinction that changed how I think about cost attribution: the difference between terminal agents and intermediate agents.

  • Terminal agents produce something a human uses: a report, a drafted email, a published post, a decision.
  • Intermediate agents produce something another agent uses: a summary, a classification, a data fetch result.

Intermediate agents are invisible on your cost dashboard because they don't produce user-visible output. But they can consume as much (or more) compute as terminal agents.

Run this exercise: for every intermediate agent in your stack, ask "what's the value per invocation?" Not the cost — the value. If you can't answer that within 30 seconds, that agent either needs better observability or it needs to be eliminated.

In my stack, I had a "context enrichment" agent that ran on every inbound message to add background information. Sounds useful. In practice, 80% of messages didn't need enrichment — they were simple queries that didn't benefit from the extra context. The agent was adding cost and latency with no measurable improvement in output quality. It's gone now.


Step 3: Instrument the Costs You Actually Care About

Standard LLM cost dashboards show you spend by model. That's not what you need.

What you need is cost by task type, not cost by model.

This requires tagging. Every agent invocation should carry metadata: which task type triggered it, which agent chain it belongs to, and whether it produced terminal output. Then you aggregate by task type, not by agent.

When I did this audit on my own stack, I found:

  • Email triage: $0.40/day (terminal, high value, keep)
  • Content research pipeline: $2.10/day (mostly intermediate agents doing redundant work, needs pruning)
  • Scheduled monitoring agents: $3.80/day (most firing with nothing to report, needed conditional logic)

The monitoring agents were the killer. They ran every 30 minutes regardless of whether there was anything to monitor. Adding a simple "if nothing changed since last check, exit early" cut that $3.80/day to $0.60/day overnight.


Step 4: Enforce Hard Limits Before You Trust Any Agent

This is the one everyone skips.

Every agent-to-agent communication chain needs:

  1. A maximum depth — how many agents deep can a single request go?
  2. A timeout — how long before we kill it and return an error?
  3. A retry limit — how many times can an agent send work back for revision?

Without these, you don't have a system — you have a conversation that can run indefinitely. Models are surprisingly creative at finding reasons to keep iterating. "Quality checks" especially. Any agent with a "review and improve" step is a candidate for infinite loops unless you bound it explicitly.

My rule: max depth of 3, max retries of 2, timeout at 90 seconds. Those aren't magic numbers — tune them to your stack. But pick numbers and enforce them. The absence of limits is where $12k months come from.


Step 5: Measure Output, Not Activity

The last piece of the audit: stop measuring how busy your agents are and start measuring what they produce.

Activity metrics are seductive. "My agents made 4,200 calls this week" sounds productive. But calls aren't value.

For every agent (or agent chain), define a unit of value:

  • Email agent: emails handled without human intervention
  • Research agent: useful summaries produced (not total summaries — useful ones)
  • Content agent: posts published (not drafted — published)

Then calculate cost-per-unit-of-value. If your email agent handles 400 emails per week at $0.40/day, that's ~$2.80/week — about 0.7 cents per email handled. That's outstanding ROI. If your research pipeline costs $14.70/week and produces 3 useful summaries, that's $4.90 per summary. Worth it or not? Only you can answer that — but now you can ask the question.

The goal isn't to minimise spend. It's to know what you're spending and why.


The Real Problem Is Observability

The $12k/month post resonated because it named something real: most agent stacks are built without observability as a first-class concern. Observability is an afterthought, bolted on after costs spike.

We built our own internal tooling — Mission Control OS — specifically because we kept hitting this wall. Every agent in our stack reports into a central runtime: what it did, how long it took, what it cost, and what it produced. The graph of agent interactions is visible, not inferred.

It took us months of production experience to understand what to instrument. The audit steps above are the distillation of those months.

The short version: if you can't draw your agent interaction graph right now, without looking at code — you're flying blind. Not because your stack is broken, but because you built it without a cockpit.

Fix the cockpit first. The turbulence gets a lot less scary when you can see what's happening.


If you're building multi-agent systems, check out Mission Control OS — we've been running it in production for a year: https://jarveyspecter.gumroad.com/l/pmpfz

Top comments (0)