DEV Community: OpsVeritas

Delegation Masking: Why Your LangChain Callbacks Lie About Sub-Agent Failures

Babar Hayat — Tue, 28 Jul 2026 08:38:58 +0000

You delegate a task from Agent A to Agent B in LangChain. Agent B fails. Agent A's callback chain fires 'success' anyway.

This is the observability blind spot most builders miss in agentic workflows: delegation masking. A sub-agent fails silently, but the parent agent's callback layer never knows because it only watches the delegation call itself, not what the delegated agent actually did.

Let's walk the mechanism.

The Delegation Pattern in LangChain

When you wire up agent-to-agent delegation in LangChain, you're typically using the tool decorator to wrap a sub-agent invocation:

@tool
def delegate_to_classification_agent(task: str) -> str:
    """Delegate classification to a specialized sub-agent."""
    result = classification_agent.invoke({"input": task})
    return result.get("output", "")

The parent agent treats this as just another tool. It calls it, gets a result, moves on. The parent agent's callback chain (the layer that logs success/failure, fires alerts, measures latency) only sees the function return value, not the internal state of the sub-agent.

Where the Failure Hides

Here's what can happen:

Agent B (sub-agent) fails to produce valid output. Its internal chain breaks, maybe a tool call failed, or output parsing broke, or the LLM went silent. Agent B's callback chain logs the failure.
But the delegation function still returns something. Maybe it returns an empty string, a cached fallback, or a generic error message. It doesn't raise an exception, it just returns.
Agent A's callback layer sees HTTP 200. The delegation tool returned something, so the callback fires on_tool_end with status: "success". The parent agent logs success, moves on.
The actual failure is buried two layers deep. Agent A's monitoring sees "delegation succeeded." Only if someone digs into Agent B's logs does the failure surface.

This is a callback visibility boundary. The parent agent's instrumentation layer is one level too high to catch delegation failures.

The Signal That Catches It

Standard token-counting observability misses this because both agents might report partial token usage (Agent B started, burned some tokens, then failed). A success callback fired, so metrics look nominal.

What actually catches delegation failures:

1. Output validation at the delegation boundary

Track what the delegation function actually returned:

@tool
def delegate_to_agent(task: str) -> str:
    """Delegate with observability."""
    result = sub_agent.invoke({"input": task})
    output = result.get("output", "")

    # Signal: validate the output exists and is non-empty
    if not output or output.strip() == "":
        # This is a silent failure, the sub-agent ran but produced nothing
        log_event("delegation_produced_empty_output", {
            "delegated_agent": "classification_agent",
            "input": task,
            "sub_agent_status": result.get("status"),
            "token_count": result.get("usage", {}).get("output_tokens", 0)
        })

    return output

The signal is: a delegation that returned empty or unchanged input. The parent agent's callback sees "success," but observability knows something went wrong.

2. Cross-agent execution correlation

When Agent A calls Agent B, log a correlation ID that links both agents' execution traces:

import uuid

correlation_id = str(uuid.uuid4())

# In Agent A's tool:
result = sub_agent.invoke(
    {"input": task, "correlation_id": correlation_id},
    metadata={"correlation_id": correlation_id}
)

# In sub-agent's callback handler:
def on_tool_end(self, output: str, **kwargs):
    corr_id = kwargs.get("metadata", {}).get("correlation_id")
    if corr_id:
        log_event("agent_execution", {
            "correlation_id": corr_id,
            "agent": "sub_agent",
            "output_tokens": ...,
            "status": "success" or "failed"
        })

Now you can query: "What sub-agent executions have a correlation_id but show empty output or error status?" That's where delegation failures hide.

3. Aggregate success rate per agent pair

Track success rates at the delegation edge, not just per-agent:

delegation_success_rate = (
    executions where parent_agent=A and delegated_agent=B and output_validation_passed
) / (total executions where parent_agent=A and delegated_agent=B)

If Agent A to Agent B delegation shows 95% success in parent logs but only 70% pass output validation, you've found the callback masking.

Why Frameworks Don't Catch This

LangChain's callback layer is designed to instrument the calling agent's perspective, not the called agent's internal state. That's by design, clean separation of concerns. But it means delegation failures are invisible until they propagate (or don't).

Most frameworks have the same boundary. CrewAI's task delegation, AutoGen's sub-agent calls, they all fire success callbacks when the call itself succeeds, regardless of what the called agent actually did.

The Observability Fix

You need one layer deeper:

Instrument sub-agents independently. Log their execution status, output validity, token usage. Don't rely on the parent agent's callback to know what happened.
Validate delegation outputs. Don't trust that a delegation function returning a value means the sub-agent actually succeeded. Check the output.
Correlate across agents. Link parent and child agent executions so failures propagate upward in your observability dashboard, not just downward in logs.

The callback chain is essential, but it's not enough. Delegation visibility requires you to see both sides of the boundary.

The takeaway: When agents delegate to other agents, callback success is not the same as execution success. Standard monitoring stays silent because the parent agent's callbacks only see the delegation call, not the delegated agent's actual work. Catch it by validating outputs, correlating executions across agents, and measuring success rates at the delegation edge.

Instrumentation Patterns for AI Agents: SDK vs Webhook

Babar Hayat — Mon, 27 Jul 2026 05:30:28 +0000

When you instrument a distributed system — a microservice mesh, a backend job queue, a real-time event pipeline — you don't ask "should we?" You ask "how?" And you know the playbook: wrap your client, push telemetry, choose your transport, decide on sampling.

AI agents need the same discipline. But right now, most builders either skip instrumentation entirely or bolt it on as an afterthought. The gap between "my agent runs" and "I know what my agent actually did" is where silent failures hide, cost spikes live invisible, and production incidents start.

There are two proven patterns for wiring observability into AI agents: SDK-based instrumentation and webhook-based telemetry. Neither is universally better, each trades off deployment simplicity, latency impact, privacy scope, and operational control. Understanding those tradeoffs matters: it determines whether you catch silent failures before your customers do.

Pattern 1: SDK Instrumentation

With the SDK pattern, you install a lightweight library into your agent's runtime and wrap your model client, the OpenAI, Anthropic, or Gemini instance your agent actually calls.

# Install: pip install opsveritas
from opsveritas import init, wrap
import openai

init(secret="YOUR_AGENT_SECRET")
client = openai.Client(api_key="...")
wrapped_client = wrap(client)  # That's it

# Now your agent uses wrapped_client instead of client
response = wrapped_client.chat.completions.create(...)

The SDK intercepts the call before it leaves your process, reads the request metadata and response (tokens, latency, cost, parsed output), and ships that telemetry asynchronously. Your agent's latency is unaffected; the SDK's overhead is a few milliseconds of serialization.

The tradeoffs:

Low latency impact. Telemetry is pushed in the background, so your agent's response time doesn't change.
In-process visibility. The SDK sees the raw request and response before they leave your Python or Node process, capturing token counts, model name, and optionally a summary of the output without re-parsing.
Framework coverage. SDKs can auto-instrument specific client libraries (OpenAI, Anthropic, Gemini) and frameworks (LangChain callbacks, CrewAI integration). Each integration is narrow but deep.
Operational cost. You manage telemetry transport, meaning SDK retries, buffering, batching. If your network is flaky, telemetry may queue or drop.
Privacy scope. The SDK runs in your environment; you control whether to strip output text, run in metadata-only mode, or send full details.
Framework coupling. You depend on SDK updates to support new models or client libraries. An obscure or internal LLM client won't be auto-instrumented.

Pattern 2: Webhook Instrumentation

With the webhook pattern, you don't install a library. Instead, you POST telemetry directly to an observability service from your agent code.

import requests
from datetime import datetime

# After your agent runs
response = agent.run(user_input)

payload = {
    "agent_name": "document-processor",
    "status": "success",
    "executed_at": datetime.utcnow().isoformat(),
    "input_tokens": 500,
    "output_tokens": 150,
    "model": "gpt-4o",
    "cost_usd": 0.0075,
    "duration_ms": 2400,
}

requests.post(
    "https://ai-agents-control-tower.onrender.com/webhooks/agent-execution",
    json=payload,
    headers={"x-agents-key": "YOUR_ORG_SECRET"}
)

You decide what to capture and POST it yourself. There's no magic, just HTTP.

The tradeoffs:

Framework-agnostic. Works with any agent framework, any LLM client, even custom scripts. You're not locked into SDK coverage.
Operational control. You own the payload shape, so you can capture custom fields (user ID, feature flags, request context) that matter to your business.
Network latency. The webhook is an HTTP request. If your observability service is slow or the network is congested, it adds latency to your agent's response time unless you fire-and-forget with an async task.
Manual instrumentation. You have to write the code to collect and POST telemetry. It's not automatically captured the way the SDK auto-patches a client.
Privacy-first. You decide exactly what data gets shipped. No SDK auto-capturing output text or summarizing responses unless you code it.
Operational resilience. If the observability service is down, your webhook requests will fail. You need retry logic and queueing to avoid blocking your agent.

How they differ in practice

Capture scope: the SDK automatically captures tokens, latency, model, and output (configurable). With webhooks, you decide, and minimal setup means only the fields you code.

Latency cost: SDK overhead is negligible, async telemetry serialization. Webhooks add 50 to 500ms per request unless you async-queue them.

Time to first signal: with the SDK it's immediate, since telemetry is already in your code. With webhooks you add instrumentation per agent or per framework, which takes more planning.

Handling new models: SDK updates add support and you upgrade. With webhooks you handle it yourself, usually just adding the cost calculation.

Privacy: the SDK is configurable, with a metadata-only mode that strips all output content. Webhooks send whatever you choose to POST.

When to use each

Use the SDK if you have a small number of well-known model clients (OpenAI, Anthropic, Gemini), want observability with minimal code changes, your agent is latency-sensitive and can't afford webhook round-trips, or you're using a supported framework like LangChain or CrewAI and want callbacks wired automatically.

Use webhooks if you have a heterogeneous stack (internal LLM API, third-party models, multiple clients), need custom telemetry fields (user context, feature flags, request metadata), want to avoid SDK dependencies and keep your deployment simple, or you're comfortable managing retry logic and async queueing.

Use both if you have a hybrid setup: SDKs for critical paths like real-time APIs, webhooks for background jobs and batch processing.

The implementation reality

In practice, the pattern you choose shapes your observability architecture for months. The SDK path is faster to ship but locks you into SDK coverage. The webhook path requires more upfront design but gives you more flexibility.

Most production AI systems end up using both: the SDK for OpenAI/Anthropic agents in hot paths where latency matters, webhooks for heterogeneous or custom setups. The tradeoff isn't binary, it's contextual.

The core insight is that instrumentation isn't optional. Whether you choose SDK or webhooks, the choice forces you to think about what you need to observe, and that discipline is what catches silent failures before production users do.

Pick the pattern that matches your architecture. Wire it in before you ship to production. And don't wait until a cost spike or a failed task to realize you have no visibility into what actually ran.

Token Anomaly Detection: The Algorithm That Stops Runaway Loops Before the Bill Arrives

Babar Hayat — Sat, 25 Jul 2026 05:37:25 +0000

The Problem: Why Token Spikes Hide in Plain Sight

Most AI teams monitor agent executions at the binary level: succeeded or failed. The agent gets 200 OK, so it worked. But there's a whole category of failure that standard monitoring misses: the execution that succeeds but consumes 10× its normal token budget in the process.

This happens most often when an agent enters a loop. It calls a tool, checks the result, calls the tool again to refine, checks again—each iteration adds to the token count. Twenty iterations later, an execution that should have cost $0.02 has cost $2.00. And it still returns 200.

By the time you see the spike in your monthly invoice, you've already lost thousands.

The fix isn't to ban loops—sometimes agents need iteration to get the right answer. The fix is to detect meaningful token anomalies before they become expensive—in the first execution, before the cascade.

The Math: Percentile Baselines and Z-Scores

A solid anomaly detector needs two parts: a baseline that captures "normal," and a method to detect when an execution drifts far from it.

Part 1: Build a Percentile Baseline

Don't use the mean. A running average is too easily pulled upward by a few expensive runs, and it doesn't tell you about the shape of your typical behavior.

Instead, track percentiles of token usage over a rolling window (typically 30 days or the last 100 runs, whichever is larger):

p50 = median token count
p95 = 95th percentile token count
p99 = 99th percentile token count

Why percentiles? Because they're robust to outliers. If your agent normally uses 500 tokens but one run uses 50,000 (an anomaly you're trying to catch), the median and p95 stay grounded in typical behavior. The mean would shift.

Part 2: Flag a Token Spike When Z > 2.5

Once you have your baseline (especially p95), check each new execution:

z_score = (tokens_this_run - p95) / standard_deviation

If z_score > 2.5, flag it. (2.5 is a threshold; adjust based on false-alarm tolerance. Higher = fewer alarms but risk missing real spikes; lower = more false alarms but catch edge cases earlier.)

Why 2.5 and not 2? At 2.0 standard deviations, you'd expect ~2.3% of normal runs to be flagged (false alarms). At 2.5, it's ~0.6%. For cost governance, false alarms are cheap (annoying, but you're just investigating); missing a real spike costs money.

The Code Sketch

import numpy as np
from collections import deque

class TokenAnomalyDetector:
    def __init__(self, window_size=100, z_threshold=2.5):
        self.window = deque(maxlen=window_size)
        self.z_threshold = z_threshold

    def record(self, token_count):
        self.window.append(token_count)

    def is_anomalous(self, token_count):
        if len(self.window) < 10:  # Need enough data
            return False

        tokens_array = np.array(self.window)
        p95 = np.percentile(tokens_array, 95)
        std = np.std(tokens_array)

        if std == 0:  # All runs identical (rare, but handle it)
            return token_count > p95 * 1.5

        z_score = (token_count - p95) / std

        is_anomaly = z_score > self.z_threshold

        # Record the run for future baselines
        self.record(token_count)

        return is_anomaly

Record every execution's token count in the window. When a new execution arrives, compute p95 and standard deviation from the window, calculate z_score, and flag if it breaches the threshold. Then add the new execution to the window.

Why Correlate With Latency

Token anomalies often come paired with latency spikes. If an execution took 10× its normal tokens and took 5× longer to complete, it's almost certainly looping or stuck in retry logic.

Add a second signal:

latency_z = (latency_this_run - p95_latency) / std_latency
anomaly = (token_z > 2.5) and (latency_z > 2.0)  # Both must spike

Requiring both signals to spike simultaneously is more conservative—fewer false alarms—but you catch the real culprits: the agent that both burned tokens and took forever. That's a loop.

If you want to catch even fast anomalies (e.g., an agent that made 50 tool calls in rapid succession, token-heavy but latency-OK), track tool-call count as a third signal:

tool_calls_z = (tool_calls_this_run - p95_tool_calls) / std_tool_calls
anomaly = (token_z > 2.5) or (tool_calls_z > 2.5)  # Either signal fires

Now you catch both slow loops (token + latency spike) and fast loops (token + tool-call count spike).

Putting It Together: The Alert

When an anomaly is detected, surface it immediately:

Which agent: name and framework.
What spiked: tokens (from 500 to 12,000), latency (from 2s to 15s), tool calls (from 3 to 47).
The baseline: "normal p95 is 600 tokens; this run was 20× higher."
The cost: if tokens spike 10×, cost spiked 10×. Show the dollar amount to make it real.
The action: link to the run's full execution trace so you can see which tool call started the loop.

In the AI Agents Control Tower, this surfaces as a Token Anomaly alert—one of the 12 alert types you can subscribe to. It fires only when a run's token usage jumps far above the agent's baseline, so you catch cost problems before they compound.

Why This Matters

The difference between detecting a runaway loop in its first execution (cost: $0.50, alert received in seconds) and detecting it in your monthly invoice (cost: $5,000, damage already done) is an anomaly detector.

The math is simple. The impact is large.

Further reading: If you're building AI agents, set a token baseline from day one. Use percentiles, not averages. And correlate token spikes with latency and tool-call count—multiple signals reduce false alarms and catch real problems faster.

One Agent Times Out. Three More Agents Don't Notice.

Babar Hayat — Fri, 24 Jul 2026 05:42:26 +0000

One agent times out. Three more agents don't notice.

Here's how a single point of failure becomes invisible in a distributed multi-agent system.

Agent A is a document processor. It's supposed to fetch a user's uploaded file, extract structured data, and pass it downstream. Latency SLA: 5 seconds.

On Tuesday at 2:47 PM, the file server hiccups. Agent A's request hangs for 7 seconds, then times out. Agent A returns an error to its caller—Agent B, the orchestrator.

Agent B sees the timeout. Rather than halt, it has a fallback: retry using the last successful result from Agent C, a caching layer. Agent C was pinged 40 minutes earlier; it has stale data from the user's previous upload. Agent B doesn't know it's stale. It just knows it's available. Agent B retrieves it and passes it to Agent D.

Agent D is the final step: validation and storage. It receives the data from Agent B, runs a schema check (passes—the data is well-formed, just old), and writes to the database. Agent D returns HTTP 200.

From the outside: the system succeeded. Four agents ran. All returned success. A user's file was "processed."

Except the data in the database is from last week.

The user didn't notice until the next day, when they queried their own data and saw timestamps that didn't match their actions.

Why this matters at scale

In single-agent systems, timeouts and retries are loud. One agent fails, the caller sees it.

In distributed multi-agent cascades, failures become options. Agent B didn't panic at the timeout—it had a graceful path. Agent C had cached data. Agent D validated successfully. Every agent did its job. The system appeared healthy to standard monitoring because every individual agent reported success.

The silent failure lived in the handoff: Agent B made a reasonable choice (use cached data) without visibility into whether that cached data was still valid. Agent D validated syntax, not currency. Agent A's timeout was absorbed, not propagated.

Standard monitoring sees four successful executions. It doesn't see that Agent B's retry decision was made on stale information, or that Agent D validated the wrong thing.

The pattern

Distributed multi-agent systems amplify a silent-failure risk that single agents don't have: one agent's failure mode becomes another agent's fallback option, and nobody asks whether the fallback is correct for this execution.

When you move from monitoring individual agents to monitoring agent ecosystems, you need visibility at the handoff points. Not just "did agent B run?" but "did agent B have fresh data when it made its retry decision?" and "did agent D validate the right version?"

Without that, you're monitoring the parts but not the system.

If you're running multi-agent systems in production, worth asking: can you trace what each agent actually received from the one before it? If not, you're flying partially blind.

Multi-Agent Pre-Flight Checklist: Testing Cascade Detection Before Production

Babar Hayat — Thu, 23 Jul 2026 05:22:54 +0000

Real stories on why automations and AI agents report "success" while quietly doing nothing.

Before you deploy a multi-agent system to production, one hard question: if a sub-agent fails silently, does your orchestrator know?

Most don't. And that's exactly where cascading failures hide.

When Agent A delegates work to Agent B, and Agent B returns HTTP 200 but produces zero output, the orchestrator often assumes success -- until your customer notices the work never happened. By then, the failure has already masked itself up the chain.

Here's a three-part checklist to catch this before production.

1. Chaos-Inject Silent Failures into Sub-Agents

In your test environment, deliberately inject a silent failure into one sub-agent: have it return success with empty output (zero output tokens, blank response). Then run your orchestrator end-to-end.

Does your orchestrator detect the empty output from the sub-agent? Does it flag it as a problem, or does it treat 200 as "success and move on"? If the latter, you have a cascade risk.

Test this for each sub-agent independently. A multi-agent system is only as reliable as its weakest detection path.

2. Validate Fallback Chains Don't Mask the Real Failure

Fallbacks are good -- but only if they surface the underlying failure, not hide it.

Scenario: Agent A calls Agent B, which fails silently. Agent A's fallback kicks in and produces a result. Now the orchestrator sees a successful end-to-end run. But Agent B's failure is invisible -- baked into the fallback response without a flag.

Before production, run a test where:

Sub-agent fails silently.
Fallback executes and succeeds.
Check: does the final result carry metadata indicating "fallback was triggered"? Can you distinguish between "worked perfectly" and "fell back after a silent failure"?

If you can't, you're masking failures you should know about.

3. Cross-Agent Callback Correlation

In a multi-agent system, each agent fires callbacks (on execution, tool call, output). If these callbacks aren't correlated across agents, you lose visibility into the chain.

Before production:

Instrument each agent with callback handlers that include a trace ID (a unique identifier that flows from orchestrator to sub-agent to callback).
Verify that when Agent A calls Agent B, the trace ID is propagated -- so you can later correlate all callbacks from Agent B back to Agent A's execution.
Without correlation, a silent failure in Agent B looks like an orphaned callback from nowhere.

Why This Matters

Cascading failures in multi-agent systems are silent by design: each layer reports success, so the failure hides until it bubbles up as a customer complaint. Testing for these patterns now -- before production -- costs hours. Missing them costs dollars and reputation.

Run these three tests on your system. If any fail, you have observability work to do before shipping.

We build OpsVeritas and AI Agents Control Tower -- monitoring layers that catch silent failures like these across your automations and AI agents before your customers do.

Your CrewAI Agent Delegates a Task. It Fails Silently. The Caller Never Knows.

Babar Hayat — Wed, 22 Jul 2026 06:37:51 +0000

Agent A delegates a task to Agent B. Agent B fails silently. Agent A never sees it.

The Pattern

CrewAI's delegated_agent pattern is powerful: one agent can offload work to another, each with its own tools and prompts. In theory, clean separation of concerns. In practice, there's a gap where failures disappear.

Here's what happens:

Agent A (the caller) creates a task and delegates it to Agent B (the delegated agent).
Agent B executes, calls its tools, and returns a task output.
Agent A receives that output and treats it as ground truth — it continues execution based on what Agent B claimed to return.

The problem: if Agent B's tools fail partially — say, one tool returns an empty response, or a tool succeeds with a malformed result — Agent B may still construct a task output that looks valid to Agent A. The output serializes cleanly. No exception is raised. But the actual work didn't happen.

Why This Matters

Inter-agent communication in CrewAI happens through task outputs, which are strings or structured objects returned by the delegated agent. These outputs are not automatically validated against the original task's intent.

Let's walk through a concrete case:

Scenario: Agent A is a "report coordinator." It delegates a task: "Fetch the latest Q3 sales data from the database and return it as a JSON array."

Agent B (the "data fetcher") has a tool called fetch_sales_data().

Here's where the gap opens:

fetch_sales_data() calls the API and gets a network timeout.
The tool catches the exception and returns a default value — an empty JSON array [].
Agent B's LLM sees [] and, because it's valid JSON, decides "the query returned no results" and returns a task output: "Sales data retrieved successfully. Result: []".
Agent A receives this output. It's a string. It parses cleanly. Agent A continues its logic, possibly iterating over the empty array or passing it downstream.
Six hours later, when you check the report, you realize the data was never pulled — the entire downstream analysis is built on nothing.

No error logs. No exception trace. The delegated agent's output looked complete.

The Mechanical Root

The root is that delegated agents don't automatically validate their own outputs against the task definition. CrewAI doesn't (by default) assert: "Did you actually complete what was asked?" It assumes that if the LLM constructed a response and the tool calls happened, the task is done.

There are three specific points where this breaks:

1. Tool-Level Fallbacks Are Silent

If a delegated agent's tool has a fallback (like return [] if error else result), the agent's LLM can't distinguish between "the query succeeded and returned nothing" and "the query failed but I'm hiding it." The output looks the same.

2. No Built-In Output Validation

When the delegated agent finishes, its task output is a string or dict. CrewAI doesn't validate it against the original task intent by default. There's no assertion like "the task asked for a list of 10 items; does the output contain 10 items?"

If you want that validation, you build it yourself — it's not in the framework.

3. Calling Agent's Assumption of Completeness

Agent A, receiving the delegated output, treats it as ground truth. It doesn't re-check. It doesn't call the original tools itself. It trusts the delegation was complete.

How This Hides

Standard monitoring of CrewAI agents often looks at:

Did the workflow complete (end-to-end status)?
How many tool calls happened?
What was the final output?

It doesn't look at:

Did each delegated agent's output match the intent of its task?
Did tool errors get swallowed by fallbacks?
Is the calling agent's reasoning grounded in real data or in an agent's polite fiction?

So you see: workflow status ✓, tool calls ✓, output returned ✓. All green. But the intermediate agent's output was empty or malformed, and it cascaded downstream.

What Builders Usually Miss

Most teams find this when they add monitoring after an incident. They instrument the delegated agent's tool calls and discover:

The tool returned an error 50% of the time, but the delegated agent returned a "success" output anyway.

Or:

The output structure was inconsistent — sometimes it's an array, sometimes it's a string. The calling agent's code broke on one variant, silently fell back to a default, and nobody noticed for two days.

What You Can Do

If you're building with CrewAI's delegation pattern:

Explicitly validate outputs in the task definition. Set clear acceptance criteria in the task prompt: "Return ONLY a JSON array of exactly these fields…" not "Fetch the data." The delegated agent's LLM will be more careful.
Log intermediate outputs. Capture what each delegated agent returns and store it (not just the final workflow output). If something goes wrong downstream, you can trace which agent's output was the culprit.
Add fallback-aware monitoring. When you instrument tool calls, track which ones failed and had fallbacks applied. That distinction matters — it's a signal that the delegated agent's output might be reconstructed, not real.
Test the delegation path in isolation. Before shipping, run the delegated agent standalone with known bad inputs (network failures, timeouts, malformed API responses) and verify the output still clearly signals "incomplete" — not just returns an empty or default value.
Observe token flow. A delegated agent that fails silently often shows a tell: unusually low token counts on the calling agent (because it had less data to reason about). If your monitoring platform tracks tokens per agent, an anomalous dip can surface these incidents early.

The Bigger Picture

This isn't a CrewAI bug — it's a structural pattern in any multi-agent system: when agents communicate through outputs rather than shared state or explicit errors, gaps emerge.

The fix isn't to avoid delegation. It's to treat delegated outputs as potentially incomplete and validate aggressively rather than assume they're ground truth just because they parse cleanly.

Your monitoring should answer: "Did the delegated agent actually do what was asked?" not just "Did the delegated agent return an output?"

That distinction is the difference between a working system and one that fails silently for hours before anyone notices.

If you're running CrewAI agents in production, what validation are you doing on delegated outputs? This is a pattern worth stress-testing before it surprises you.

Your Agent's Bill Jumped 40%. It Never Errored Once.

Babar Hayat — Tue, 21 Jul 2026 07:21:24 +0000

You ship an AI agent to production. It runs smoothly for a week. Then one morning your LLM bill is 40% higher than expected—and you have no idea why. By the time you dig into logs, the damage is done.

The problem isn't that your agent broke loudly. It's that cost spikes are silent. Your agent succeeded. It returned tokens. It burned budget. And without a baseline to compare against, you won't see the spike until the bill arrives.

Why Baselines Matter (and Why Most Setups Skip Them)

Every LLM agent has a natural token consumption pattern. For a given input, it tends to use roughly the same number of tokens each time—within a predictable range. That range is your baseline.

When an agent deviates sharply from its baseline, something changed:

A retry loop is running (hitting rate limits, trying again, trying again).
The prompt got longer accidentally.
A tool call is looping—each iteration calling the model again.
The model switched to a more expensive variant.

The catch: a 30% spike in tokens is real and expensive, but it's invisible if you're only looking at raw counts. You need a baseline to know it's an anomaly at all.

How to Build a Token Baseline (Three Metrics)

The cleanest approach uses percentiles. Here's why percentiles work better than simple averages:

Median (p50): Half your runs use fewer tokens than this; half use more. It's stable and unaffected by outliers. If your agent usually takes 150 tokens per run and the p50 is 160, that's your normal.

95th percentile (p95): 95% of your runs fall below this threshold. It captures the "busy day" for your agent—the legitimate high-end of normal. If your p95 is 400 tokens, you expect some runs to hit that; they're not anomalies.

99th percentile (p99): The rare, expensive run. Useful for budgeting headroom, but too loose for early anomaly detection.

Why percentiles over averages? One runaway loop pushes an average up by 30%. The p95 barely moves—it's already accounting for variability. Averages lie when there are outliers. Percentiles don't.

The Signal-to-Noise Problem: When Do You Alert?

Here's where most baselines fail: they're too trigger-happy. Alert on every spike above p95, and you'll get alerts when the agent just had a legitimately complex input. Ignore real spikes, and you miss the runaway loop that costs $500.

The trick is meaningful deviation: a spike that's both large and sustained.

Single-run spike above p99? Probably legitimate—the agent got a complex query, used more tokens, moved on. No alert needed yet.

Three runs in a row above p95, with at least one above p99? Now we're talking. Multiple high-cost runs suggest something systematic changed, not a one-off.

Weekly spend 50% above last week's baseline? That's a signal worth investigating.

The math is straightforward: if your agent normally costs $0.02 per run (p50 at 100 tokens * $0.00015/token for GPT-4o) and you see three runs at $0.08 each, that's 4x normal. Investigate.

How Baseline Detection Works in Practice

Here's a concrete example. Say your agent summarizes customer feedback:

Run 1: 145 tokens
Run 2: 152 tokens
Run 3: 148 tokens
Run 4: 5,200 tokens ← anomaly
Run 5: 4,800 tokens ← anomaly again
Run 6: 156 tokens (back to normal)

Your p50 baseline: ~150 tokens. Your p95: ~160 tokens.

Runs 4 and 5 are 30–40x your baseline. One spike could be noise. Two in a row is a pattern. If you're running this agent 100 times a day, 30 of those runs suddenly costing 5000 tokens changes your bill overnight.

A baseline system flags this: "Runs 4–5 exceeded p99 + sustained above p95. Investigate: possible retry loop or tool recursion."

You check the logs, find the agent is calling a downstream API that's slow, triggering a retry loop. You add a timeout or a backoff. Crisis averted.

What You Need to Implement This

To build and track baselines, you need:

Per-agent token counts from every run. Not just successes—failed runs and retries matter too, because they're where costs spike.
Time-windowed percentiles. Recompute p50, p95, p99 weekly or daily so your baseline adapts. (If you upgraded your model, the new baseline should reflect that.)
Spike detection tied to your baseline. Runs > p99 + multiple occurrences = alert. Runs > p95 once = log, but don't alert.
Cost overlay. Baseline tokens are useful; baseline cost is what matters to your CFO. Map tokens to USD per model, then alert on cost anomalies too.

Most standard application monitoring tools (DataDog, New Relic, etc.) don't track model-specific token usage out of the box—you're usually exporting logs and computing percentiles yourself. That's labor-intensive and easy to miss.

A simpler approach: Use an observability tool built for LLM agents. The AI Agents Control Tower, for example, auto-computes p50/p95/p99 baselines per agent and alerts when a run spikes above them—no manual percentile math, no custom logging. You define your SLA (expected cost or latency), and the system flags deviations automatically.

The Baseline Mindset

The broader lesson: silent cost spikes are only silent if you have no baseline to compare against. The moment you establish what "normal" looks like for your agent—in tokens, cost, latency, or all three—anomalies become visible.

Before you ship your next AI agent:

Establish a baseline during your pre-prod or first week of live traffic.
Track p50, p95, and p99 token usage per agent.
Set a meaningful deviation threshold (not every spike, only sustained ones).
Alert on cost anomalies, not just execution failures.

Your bill (and your sleep) will thank you.

Have you seen a cost spike on an agent you didn't expect? What caught it—or what didn't?

Yes, We Support LangGraph — Here's Why That Was Never a Separate Integration

Babar Hayat — Mon, 20 Jul 2026 04:43:53 +0000

We never built LangGraph support. It's worked from day one anyway — and until now, we'd never actually said so anywhere.

That's not a typo. Here's the reasoning, and why it matters if you're one of the teams running LangGraph in production (Klarna, Uber, LinkedIn, BlackRock, Cisco, Elastic, and JPMorgan all reportedly do).

LangGraph doesn't have its own way of calling a model

LangGraph models agents as explicit state graphs — nodes, edges, persistent state, checkpoints, human-in-the-loop gates. It's a genuinely good abstraction for production agent orchestration. But when a LangGraph node actually needs to call an LLM, it doesn't invent a new calling convention to do it — it reaches for a standard LangChain chat model and calls .invoke() or .ainvoke() on it, exactly the same interface any other LangChain code uses.

That single fact is the whole story.

What we actually built

Our LangChain integration works at the model-object layer, not the orchestration layer:

opsveritas.langchain("Agent Name") — a callback handler for telemetry (tokens, cost, latency, status).
wrapLangChain() / wrap_langchain() — patches .invoke()/.ainvoke() directly on the model object, for kill-switch enforcement (see our kill switch article for what that actually does).

Neither of those cares what's calling the model. A LangGraph node, a plain LangChain chain, a custom orchestration loop — if it's invoking a wrapped model object, we see it. We didn't write a single line of LangGraph-specific code, because the integration point was never LangGraph. It was always the model.

model = ChatOpenAI(callbacks=[opsveritas.langchain("My Agent")])  # telemetry
opsveritas.wrap_langchain(model, agent_name="My Agent")           # kill-switch enforcement
graph_builder.add_node("call_model", make_node(model))            # LangGraph uses the wrapped model, unmodified

That's the entire integration. The graph never knows the model underneath it is instrumented.

Why we're only saying this now

Because until recently, we hadn't said it anywhere — not the docs, not onboarding, not the dashboard. A prospect evaluating us against a LangGraph-based stack would have had no way to know it already worked. That's a real gap between what the product does and what it visibly claims to do, and it's now closed across our docs, onboarding, and settings pages.

If you're running LangGraph and monitoring silent failures, cost, or kill-switch enforcement, wrap the model before you pass it to the graph. That's it.

The Kill Switch: Auto-Pause Runaway AI Agents Before They Burn Your Budget

Babar Hayat — Mon, 20 Jul 2026 04:43:38 +0000

Every AI agent monitoring tool will tell you when you've gone over budget. Almost none of them will actually stop it from happening again five minutes later.

That's the gap between an alert and a kill switch, and it's the reason we built one.

The problem with alert-only budget monitoring

Here's the typical flow: your agent racks up an unexpected cost spike — a bad prompt loop, a retry storm, a runaway multi-agent chain calling itself in circles. Your monitoring tool notices, sends you an email or a Slack ping, and... that's it. The agent keeps running. It keeps spending. You find out you're over budget at the exact same moment your bill does.

An alert is a notification. It's not a brake pedal.

What the kill switch actually does

The kill switch is a monthly budget threshold, set per organization, with real enforcement:

Cross the threshold, and every agent in the org auto-pauses — not "gets flagged," actually pauses.
For SDK-integrated agents (wrap(), monitor(), and now wrapLangChain()/wrap_langchain()), the enforcement happens before the API call — the wrapped client throws a typed OpsVeritasKilledError instead of ever reaching OpenAI, Anthropic, or Gemini. The cost genuinely stops, not just the notification.
One-click restart from the dashboard, with the same actionable guidance whether you restart one agent or all of them: raise the budget or disable the kill switch first if you want it to actually keep running past the next paid call.
It resets monthly, with rollover — this is a recurring monthly cap, not a permanent lifetime ceiling you cross once and stay crossed forever.
The SDK polls for kill state in the background (every 3 minutes), gated entirely on whether your org has opted in — if you never enable it, your agents generate zero extra background traffic. And if the poll itself fails for any reason (network blip, backend hiccup), it fails open — a transient monitoring issue should never be the thing that breaks your production agent.

The one honest limitation

This only works for agents integrated via the SDK. If your agent only sends telemetry over the generic webhook (the zero-code path for n8n, Make, or any custom script), you still get the alert — but there's no OpsVeritas code running inside your process to actually gate the call, so nothing stops running. We'd rather tell you that plainly than let you assume enforcement exists where it structurally can't.

Why this matters more than it sounds

The real cost of a runaway agent isn't usually the first spike — it's the one that keeps compounding while nobody's looking, because "I'll check the dashboard later" is exactly the gap a kill switch is built to close. An alert assumes a human is watching. A kill switch doesn't need one to be.

Available now on both SDKs (opsveritas-sdk on npm, opsveritas on PyPI). If you're already instrumented with wrap() or langchain(), enabling enforcement is a one-line addition — no re-architecture required.

HTTP 200 Is Not a Product Guarantee

Babar Hayat — Sat, 04 Jul 2026 00:49:16 +0000

HTTP 200 Is Not a Product Guarantee

AI Agents in Production - Series 2, Article 5 of 6

An AI agent ran 47 times last week.

Every run returned HTTP 200. Every run had latency under 2 seconds. No exceptions. No errors in the logs.

And every run produced absolutely nothing.

output_tokens: 0. Forty-seven times in a row.

The infrastructure saw success. The product saw nothing. And no alert fired.

The Failure Class Nobody Monitors

Most teams monitor what the API says. HTTP 200 means the request was accepted, processed, and returned cleanly. That is true. The infrastructure worked perfectly.

But HTTP 200 only tells you about the transport layer. It says nothing about what your agent was supposed to produce.

This is the failure class called silent failure: the system reports success while the business value goes to zero.

It is the most dangerous failure type because monitoring dashboards show green, on-call does not get paged, and the client has no idea for days or weeks.

Three Failure Shapes That Look Like Success

Shape 1 - The Empty Responder

The agent calls the LLM, gets back an empty completion. choices[0].message.content is empty. HTTP 200. output_tokens: 0. Your pipeline continues as if everything worked.

Shape 2 - The Stuck Safety Filter

The model triggers an internal safety classification and returns an empty response rather than a refusal. No error. Just silence.

Shape 3 - The Token Drain

Prompt tokens go up, output tokens stay at zero. You are paying for every prompt, generating nothing. All three return HTTP 200.

What You Actually Need to Monitor

`python
from opsveritas import AgentTracer

tracer = AgentTracer(api_key="your-key")

with tracer.trace("content-generator") as span:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)

content = response.choices[0].message.content
output_tokens = response.usage.completion_tokens

span.set_output(
    summary=content[:500] if content else "[EMPTY - silent failure]",
    output_tokens=output_tokens,
    success=bool(content and output_tokens > 0)
)

Success is not http_status == 200. Success is output_tokens > 0 AND the output contains meaningful content.

`javascript
import { AgentTracer } from "@opsveritas/sdk";

const tracer = new AgentTracer({ apiKey: "your-key" });

const span = tracer.start("content-generator");
try {
const response = await openai.chat.completions.create({ ... });
const outputTokens = response.usage.completion_tokens;

span.finish({
outputTokens,
outputSummary: response.choices[0].message.content || "[EMPTY]",
success: outputTokens > 0
});
} catch (err) {
span.error(err);
}
`

The Alert That Should Have Fired

json { "alert_type": "silent_failure", "workflow_name": "content-generator", "message": "output_tokens = 0 on 47 consecutive runs. HTTP 200 returned each time.", "diagnosis": "Likely: safety filter on prompt, empty completion, or context window exhaustion.", "consecutive_empty_runs": 47, "cost_usd_burned": 0.22 }

Note the diagnosis field: AI-generated root cause analysis, not just a raw metric.

What This Changes

When you monitor output_tokens alongside HTTP status, silent failures get caught in the first run, not the 47th. You stop paying for token consumption that produces nothing. Client-facing failures get resolved before the client notices.

HTTP 200 is a transport guarantee. Build your monitoring at the product layer.

Free to try: https://agents.opsveritas.com

We also build these end-to-end: https://opsveritas.com

DM for a 15-min demo if you are running agents in production.

Two Lines of Code to Full Observability

Babar Hayat — Thu, 02 Jul 2026 18:21:49 +0000

Most observability platforms require you to rebuild your infrastructure around them. New logging pipelines. New deployment configs. Dedicated sampling layers. By the time you have "observability," you've also rewritten half your stack.

The AI Agents Control Tower SDK doesn't work that way.

The Two Lines

Python:

from opsveritas import track
result = track(my_agent_function)(input_data)

JavaScript:

import { track } from '@opsveritas/sdk';
const result = await track(myAgentFunction)(inputData);

That's it. Your existing agent call, wrapped. No new infrastructure. No deployment changes. No sidecars, no proxies, no log aggregators.

What Happens From Those Two Lines

From the moment you wrap the call, every execution automatically sends: duration, token usage, USD cost, status (success/failed/empty), output_summary, and timestamp — all flowing into your agents.opsveritas.com dashboard, visible per-execution and per-agent across your entire fleet.

The Custom Webhook Fallback

Already using LangSmith? Running a framework that doesn't wrap cleanly? The platform also accepts raw webhook payloads:

curl -X POST https://agents.opsveritas.com/api/sdk/ingest \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"agent_name":"my-agent","status":"success","duration_ms":1250,"tokens_used":842,"cost_usd":0.0017}'

Same dashboard. Same alerts. Any platform, any language, any orchestration framework.

What You Get Automatically

Once the SDK is connected, the platform detects:

silent_failure — output_tokens = 0 on HTTP 200 (ran successfully, produced nothing)
token_anomaly — usage 3× above baseline
agent_loop — same tool call repeated (runaway retry behavior)
budget_exceeded — total spend over configured threshold
high_cost_spike — single-run cost anomaly
execution_failed — non-200 response from the model
no_activity — agent hasn't run in its expected window

All 7 alert types, active automatically. No configuration beyond the API key.

The Monitoring Gap Is a Friction Problem

Teams that skip observability aren't being reckless. They're making a rational trade-off: the perceived effort of setting up monitoring versus the urgency of shipping the next feature.

Two lines and an API key is the whole setup. That's a trade-off worth making on any sprint.

Free to try → agents.opsveritas.com
We also build AI agents and automations end-to-end → opsveritas.com
DM me for a 15-min walkthrough.

Token Costs That Compound While You Sleep

Babar Hayat — Wed, 01 Jul 2026 11:55:38 +0000

An AI agent ran inside a customer's pipeline for 30 seconds. By the time anyone looked at the logs, it had made 47 API calls, bloated its context window to 128k tokens, and spent $23.40.

The alert arrived the next morning. The bill arrived 30 days later.

This is the cost compounding problem. It's not about one expensive run — it's about not knowing a run was expensive until long after it happened.

Why AI agent costs compound

Three scenarios cause most runaway token spend:

Context bloat. Agents that don't trim their context window accumulate history across turns. Turn 1: 1,200 tokens. Turn 10: 18,000 tokens. Turn 20: 67,000 tokens. Each call costs more than the last, and the agent never tells you.

Retry storms. An agent hits a rate limit or a malformed JSON response and retries. Each retry is a full prompt re-send at full token cost. Without a circuit breaker, the agent retries until it exhausts the budget or times out. We've seen 12 retries in under 2 minutes.

Agent loops. The agent calls a tool, the tool returns output, the agent reinterprets the output and calls the tool again. Same tool, same parameters, slightly different framing. Repeat 30 times. This is the agent_loop failure mode — it produces no useful output and runs up cost in parallel.

What per-execution cost tracking actually looks like

Most platforms give you daily token totals. That's useful for billing. It's useless for debugging.

What you need is per-execution cost in USD — not just input tokens, not just output tokens, real dollars, broken out per run.

In AI Agents Control Tower, every execution row shows:

Total cost in USD (input + output tokens × model rate)
Token breakdown (input / output separately)
Model used (cost rates differ significantly across GPT-4o, Claude Haiku, Gemini Flash)
Execution duration in seconds

When you see a run that cost $0.23 instead of the usual $0.003, you know to look at it. When you see 47 runs in 2 minutes all at $0.50+, you know you have a loop.

Budget alert thresholds

The budget_exceeded alert fires when cumulative spend on an agent crosses a threshold you set. You configure it per agent — not per account, per agent — because a scraper agent might run at $0.10/day while a reasoning agent might run at $2/day, and both are normal.

The threshold is configurable on the agent detail page. When it fires, it routes the same as any other alert: Slack, email, Teams, simultaneously, with the agent name, current spend, and threshold included in the message.

There's also high_cost_spike — a single execution that costs significantly more than that agent's rolling baseline. This catches one-off anomalies before they become sustained runaway spend.

The right mental model

Your AI agent bill is a function of decisions made at design time — context window management, retry logic, loop detection — not just runtime usage. But you can't improve what you can't see.

Per-execution cost tracking is what makes the cost side of AI agents observable. Not a monthly summary. Not a vague "tokens used" counter. A row per execution with a dollar amount attached.

That's what we built into AI Agents Control Tower, and it's free to try at agents.opsveritas.com — 2-line SDK, no new infrastructure.

We also build custom AI agents end-to-end: opsveritas.com

DM me if you want a 15-min walkthrough.