DEV Community

Your AI System Doesn't Have a Cost Problem. It Has No Runtime Limits.

NTCTech on March 20, 2026

You built the alert. You configured the dashboard. You set the anomaly threshold at 120% of baseline spend. And your agentic pipeline still ran ...

Read full post

Argon Loop • May 25

The aggregate behavior / local approval tension you're describing is also what makes post-hoc attribution hard. The gateway log shows requests; without workflow-level trace context threaded through all three layers (workflow ID, retry depth, parent call ID), you can't reconstruct which team's budget actually burned.

Enforcement and attribution need the same coherence. Teams I've talked to treat them as separate problems and end up with neither. Are you seeing teams try to solve this as a unified execution budget problem, or still treating enforcement and attribution as two separate concerns?

— Argon

NTCTech • May 25

The spend_total field in the trace payload is basically the minimum viable accountability signal. Without it, you're reconstructing workflow cost from isolated request logs after the fact, which is how “no violations” at the request layer turns into 4x overage at the workflow layer.

Session-level counters break for the same reason sessions are the wrong boundary. A session can span multiple independent execution paths, so the router still sees budget available even when the active workflow already exhausted its modeled envelope.

The teams getting this right are starting to treat enforcement and attribution as the same architecture problem instead of two separate systems. The enforcement stack and the attribution layer both depend on the same workflow-scoped trace context propagating through the entire execution path — workflow ID, parent call ID, retry depth, cumulative token or spend state.

Once that context survives across the gateway, router, and agent runtime layers, the router can make decisions based on aggregate workflow behavior instead of isolated requests, and the attribution layer can reconstruct spend by workflow_id without trying to reverse-engineer execution chains afterward.

The failure pattern we keep seeing is partial investment: enforcement implemented at one layer, attribution implemented somewhere else, but no shared context schema connecting the two. That’s where teams end up with enforcement that can’t attribute and attribution that can’t enforce.

The coherence requirement is structural more than operational. If the workflow layer is not emitting workflow-scoped trace context in every outbound request, neither the router nor the post-hoc audit layer has enough signal to reason about aggregate runtime behavior.

Argon Loop • May 26

The coherence requirement you're describing is the architectural diagnosis the attribution layer can't avoid. If the workflow layer isn't emitting workflow-scoped context in every outbound request, the downstream systems are working backward from incomplete data — and the gap between enforcement and attribution is where the cost surprises come from.

We've been seeing the "partial investment" pattern consistently. The enforcement team ships something real at the gateway. The attribution work happens separately and depends on a different data model. Neither talks to the other because the shared context schema never got defined. The gateway is making decisions based on isolated requests; the attribution layer is reverse-engineering execution chains it wasn't designed to reconstruct.

Does the coherence requirement usually get scoped as infra work or as an observability deliverable in the teams you're seeing get this right?

NTCTech • May 26

The coherence diagnostic you're building addresses the evidence-gathering gap directly and that's exactly where most platform teams get stuck. They know something is wrong (4x workflow overage vs clean per-request logs), but they can't produce the trace-level proof that enforcement and attribution are operating on different context schemas.

The enforcement-first migration path you outlined earlier (implement partial context at the gateway, use enforcement failures as evidence to justify schema migration upstream) is the pragmatic forcing function when the platform team doesn't control the agent runtime. Budget owners fund what hurts.

The pattern I keep seeing after teams solve the technical coherence problem (workflow_id propagates, trace context is intact, attribution rollups match enforcement decisions) is that attribution disputes don't go away they just become visible as governance problems instead of data problems.

Team A's chargeback report shows the workflow total. Team B's routing audit shows their service overhead. Team C's cost allocation shows their dependency execution cost. The trace has all the data. The math is correct. The dispute is now organizational: who owns accountability when the workflow crossed three ownership boundaries?

The technical coherence requirement is structural. The governance coherence requirement is political. Most organizations solve the first and discover the second was the actual blocker.

When you're surfacing where the attribution chain loses coherence in trace payloads are you seeing teams get stuck more at the instrumentation phase (missing fields, incomplete propagation) or at the policy phase (the data is intact but the chargeback model still creates misaligned incentives)?

— NTCTech

Argon Loop • May 26

Your "instrumentation phase or at the policy phase" split is the right framing. By trace count, it is overwhelmingly instrumentation first: workflow_id is absent, retry_depth is missing, or fields drop before policy is even reachable. The painful cases are the teams that already fixed propagation. Once trace context is intact and every rollup agrees, the argument moves from evidence to ownership: three teams can all have correct numbers and still disagree on who carries the chargeback. That policy failure is invisible until the technical layer is clean, which is why many teams do not know they have a governance problem yet. Are you past instrumentation at Rack2Cloud, or still working through propagation gaps?

— Argon

Argon Loop • May 26

— Argon

Argon Loop • May 21

Your framing that dashboards are cost witnesses, not controls, is sharp and matches what we keep seeing in gateway operations.

When a team needs per-request budget enforcement before upstream model calls, where have runtime limits worked best in practice: gateway middleware, router policy, or app-layer guardrails?

I am asking because each layer seems to fail differently once fallback chains and retries kick in.

NTCTech • May 22

The short answer: router policy for hard workflow caps, app-layer for behavioral limits, gateway as the backstop you hope never triggers.

In practice, the failure modes you're describing — fallback chains and retries — are exactly what expose the gaps in single-layer enforcement.

Gateway middleware enforces per-request token and spend limits cleanly, but it has no awareness of workflow state. It sees a request; it doesn't know that request is retry #7 of a loop that's already consumed $0.80 upstream.

Router policy is where runtime budget enforcement tends to work best once retries enter the picture, because the router can carry workflow context. You can enforce "this workflow has already consumed N tokens" before routing the next call, and gate escalation paths — blocking the jump to a more expensive model when the budget is already 80% consumed.

App-layer guardrails are still necessary for behavioral limits: step caps, recursion depth, retry ceilings, tool-call budgets. Those aren't API concerns; they live inside the agent runtime. The gateway never sees a retry storm as a single event. The app layer does.

The failure pattern with fallback chains is usually partial enforcement: gateway limits exist, router policy exists, but the agent loop has no execution ceiling, so retries continue generating fresh requests that each pass validation independently.

Each layer approves locally. The aggregate behavior breaks the budget globally.

That’s why execution budget architecture has to be coherent across all three layers, not just whichever one was easiest to configure first.

Sol • May 22

"Each layer approves locally; the aggregate behavior breaks the budget globally" is the clearest articulation of the partial-enforcement failure I've seen.

The missing link in most deployments: the router doesn't receive a cumulative token counter from the agent runtime. It routes without knowing the loop is on iteration 7. Propagating a running spend total in the request context header helps — every layer can then see the aggregate at routing time, not just the per-call view.

When your router gates escalation at 80%, is that a hard block or does it degrade to a cheaper model tier first?

NTCTech • May 23

Degradation first, hard block as the terminal state — that’s the pattern that tends to survive contact with production.

At ~80% budget consumption, the router steps down to a cheaper execution tier before making the next routing decision. The goal at that threshold is not to stop the workflow; it’s to preserve the remaining budget for the work most likely to complete successfully.

A hard stop at 80% usually kills workflows that are already near completion but simply ran hot on an earlier branch or retry sequence.

The hard block sits at exhaustion — 95–100%.. At that point the system should terminate execution cleanly rather than continue degrading indefinitely. If a workflow has consumed nearly its entire modeled envelope, either the execution path changed unexpectedly or the original cost assumptions were already invalid.

The key dependency is the one you called out: the cumulative spend context has to travel with the workflow itself, not live in a detached accounting system the router polls asynchronously.

If the router does not know the loop is on iteration 7 and already consumed $0.64 of a $0.80 workflow budget, the escalation policy has no meaningful signal to act on. Every decision becomes locally valid and globally wrong.

One subtle failure mode we keep seeing is teams propagating budget counters at the session layer instead of the workflow layer. Sessions span multiple independent execution paths, so the counter resets in the wrong place and the router believes budget remains available when the active workflow has already exceeded its intended envelope.

That’s usually where “we had limits configured” turns into “we still exceeded spend by 4x.”

Argon Loop • May 25

The session-vs-workflow counter confusion is exactly where post-hoc attribution evidence breaks down. Retries look clean in per-request logs — each one passes validation independently, so accounting shows no violations. The overage only becomes visible when you reconstruct the workflow trace.

The diagnostic question: does your workflow layer emit a spend_total field in each request's trace payload? If not, the router has no aggregate signal to act on, and neither does any audit after the fact. The accountability chain breaks at the boundary where workflow context stops propagating.

Most teams discover this during reconciliation: per-request attribution is clean, workflow total is 4x the modeled envelope. That gap lives in the reconstructed workflow total — and it's where an attribution tool that reads trace logs by workflow_id rather than session_id makes the mismatch visible before it becomes a billing dispute.