DEV Community

Assili Salim
Assili Salim

Posted on

AI Coding Agents Need Runtime Telemetry Before Commit Telemetry

A new arXiv paper published on June 23, 2026 scanned more than 180 million Git repositories to detect traces of AI coding agents in open source. The authors used multiple signals, including configuration-file scanning, commit-message analysis, author-identity matching, and bot-signature lookup.

The most useful result for developers is the visibility gap.

In one snapshot, multi-method detection found 850,157 Claude Code commits.

Bot-account lookup found only 28,154.

That is 3.3%, or a 30x relative recall gap.

The paper also reports more than 320,000 commit-attributed agent commits per month across snapshots from December 2024 to April 2026.

The immediate takeaway:

AI coding agents are being used heavily.

The engineering takeaway:

Single-signal observability is weak.

Commit telemetry is too late

A commit is the end of an agent run.

It does not tell you enough about the run itself.

A commit may not show:

how many model calls happened
how many retries happened
whether prompts repeated
whether tools failed
whether the model price was known
whether the run exceeded budget
whether the agent made progress
whether fallback models were used
whether the agent stopped safely

If you only inspect the repository after the fact, you are observing the artifact.

You are not observing the execution.

For agent systems, execution is where many failures happen.

Agents are loops

A coding agent is usually some version of this:

while (!task.done) {
const response = await model.call(task.context);

const action = parseAction(response);

const result = await runTool(action);

task = updateTask(task, result);
}

This is useful.

It is also incomplete.

There is no budget.

No max-step limit.

No retry control.

No prompt-loop detection.

No known-pricing check.

No no-progress stop.

A safer runtime shape puts a decision before the provider call.

const decision = guard.beforeCall({
runId: task.id,
model: task.model,
prompt: task.currentPrompt,
stepCount: task.steps.length,
retryCount: task.retryCount,
previousPrompts: task.previousPrompts,
budgetRemaining: task.budgetRemaining,
progressState: task.progress,
});

if (!decision.allowed) {
return {
status: "stopped",
reason: decision.reason,
error: decision.error,
};
}

const response = await model.call(task.context);

The important part is not the exact API.

The important part is timing.

The check happens before the provider call.

That means the runtime can stop unsafe execution before more cost is created.

What to log before the call

A useful agent runtime should log decision inputs, not only final outputs.

For each provider call, consider recording:

type AgentCallDecision = {
runId: string;
model: string;
modelPriceKnown: boolean;
stepCount: number;
maxSteps: number;
retryCount: number;
budgetRemaining: number;
estimatedNextCallCost: number;
promptSimilarityScore?: number;
progressScore?: number;
allowed: boolean;
stopReason?: string;
};

This gives you data that a commit cannot provide.

You can now ask:

Which tasks hit max steps?

Which runs stopped because pricing was unknown?

Which prompts repeated?

Which models caused budget pressure?

Which agent workflows produced commits only after many failed attempts?

Which agents consumed budget without progress?

That is runtime telemetry.

Guardrails to implement first

  1. Max-step limits

Agents should not run forever.

if (stepCount >= maxSteps) {
return {
allowed: false,
reason: "max_steps_exceeded",
};
}

This is basic.

It is also one of the highest-value controls.

  1. Unknown pricing blocks

If the runtime cannot price the model, it cannot enforce a budget.

if (!pricingCatalog[model]) {
return {
allowed: false,
reason: "unknown_model_pricing",
};
}

Do not guess.

Fail closed.

  1. Budget guards

Budgets should exist at the task level, not only at the account level.

if (estimatedNextCallCost > budgetRemaining) {
return {
allowed: false,
reason: "budget_exceeded",
};
}

A small refactor and a multi-hour migration should not share the same ceiling.

  1. Retry-storm detection

Retries are normal.

Retry storms are not.

if (retryCount > maxRetries && recentErrorsAreSimilar(errors)) {
return {
allowed: false,
reason: "retry_storm_detected",
};
}

The goal is not to ban retries.

The goal is to stop blind repetition.

  1. Prompt-loop detection

If the current prompt is almost the same as previous failed prompts, the agent may be stuck.

if (similarToRecentPrompt(currentPrompt, previousPrompts)) {
return {
allowed: false,
reason: "similar_prompt_loop",
};
}

Even a simple similarity check can catch obvious waste.

  1. No-progress detection

A run can be active and still not moving.

Track progress signals:

tests passing
errors decreasing
files changing meaningfully
checklist items completing
user-defined success criteria improving

If those signals do not change after several steps, stop.

Why this matters now

GitHub has already said Copilot moved to usage-based billing on June 1, 2026, with usage calculated from token consumption including input, output, and cached tokens. GitHub also described Copilot as moving from an in-editor assistant into an agentic platform capable of long, multi-step coding sessions across repositories.

That means agent runtime behavior increasingly has direct cost impact.

A loop is no longer just a UX problem.

It is a billing problem.

A retry storm is not just noisy.

It is spend.

A prompt loop is not just inefficient.

It is measurable waste.

Where AI CostGuard fits

AI CostGuard is the local-first TypeScript / Node.js runtime safety layer I’m building for this problem.

It focuses on stopping agent failures before provider calls execute:

retry storms
prompt loops
max-step explosions
no-progress runs
budget overruns
unknown model pricing
runaway agent behavior

The key design question is simple:

Should this next provider call be allowed?

If the answer is no, the runtime should return a structured stop reason before the call happens.

Takeaway

The new arXiv paper shows that even detecting AI coding-agent activity in repositories requires multiple signals.

That lesson applies directly to runtime engineering.

Do not wait for the commit.

Do not wait for the dashboard.

Do not wait for the invoice.

Instrument the loop.
Add one pre-call decision log to your agent runtime before adding another dashboard.
https://github.com/salimassili62-afk/ai-costguard

Top comments (2)

Collapse
 
nazar_boyko profile image
Nazar Boyko

Of everything on the guard list, the no-progress check is the one I'd most want to get right, and it's also the slipperiest to define. Tests passing, errors dropping, files changing, all of those can move while the agent is really just circling the same spot. How are you scoring progress so it doesn't either fire too early on a slow but real step, or get gamed by an agent that keeps making cosmetic edits? That feels like the part that decides whether the whole guard is trustworthy.

Collapse
 
frank_signorini profile image
Frank

This is a super interesting take! I