DEV Community: Tanmay Devare

AI Coding Agents Don't Have a Reasoning Problem. They Have an Execution Problem.

Tanmay Devare — Wed, 22 Jul 2026 08:40:00 +0000

Over the last few months, I've been building an engineering runtime for AI coding agents called FROST.

Ironically, while building it, I realized I was solving the wrong problem.

Like everyone else in the AI tooling space, I initially obsessed over things like:

Context windows
Token reduction
Multi-agent systems
Branching strategies
Benchmarks
More tools

The assumption was simple:

AI coding agents fail because they aren't intelligent enough.

I don't think that's true anymore.

The Problem

Let's take a real engineering task.

Upgrade a production FastAPI repository from 2023 standards to 2026 standards.

That sounds straightforward until you realize what that actually means.

The repository needs to evolve from:

Python 3.10
↓

Python 3.14

Pydantic V1
↓

Pydantic V2

SQLAlchemy 1.4
↓

SQLAlchemy 2.0

Legacy cryptography libraries
↓

Modern cryptography stack

Old Docker configuration
↓

Modern Docker configuration

Old dependencies
↓

Latest ecosystem versions

And that's before you even run the tests.

Suddenly you're dealing with:

Breaking API changes
Dependency conflicts
Migration strategies
Compatibility layers
Database issues
Threading issues
Documentation updates
CI failures
Multiple valid engineering approaches

None of this is a prompting problem.

It's an engineering execution problem.

AI Coding Agents Are Surprisingly Bad at This

Most AI coding agents are excellent at writing code.

They're much worse at evolving codebases safely.

A typical workflow looks something like this:

Run tests.

↓

47 failures.

↓

Try random fix.

↓

31 failures.

↓

Try another fix.

↓

Repository is now partially broken.

↓

Context window is polluted.

↓

The agent forgets why it made the previous decision.

↓

Start over.

After enough iterations, the repository becomes harder to reason about than when we started.

The problem isn't intelligence.

The problem is that repository evolution is inherently uncertain.

The Biggest Thing I Learned Building FROST

The biggest realization I had while building FROST was this:

Difficult engineering tasks are execution problems, not reasoning problems.

We don't need a smarter model to know that Pydantic V2 exists.

We need infrastructure that can safely answer questions like:

Should we add a compatibility shim?
Should we refactor the public API?
Should we pin a dependency temporarily?
Should we rollback this migration?
Should we branch into multiple approaches?
Have we already tried this fix before?

These are engineering questions.

Repository Evolution Isn't Linear

A repository migration rarely looks like this:

Task

↓

Fix problem.

↓

Done.

It looks more like this:

Upgrade repository.

↓

Tests fail.

↓

Engineering uncertainty detected.

↓

Explore alternatives.

↓

Kill losing approaches.

↓

Merge the winning solution.

↓

Continue execution.

↓

Next uncertainty point.

↓

Repository is green.

Repository modernization is essentially a series of uncertainty points.

That's where AI coding agents tend to struggle the most.

The Real Dogfooding Test

Instead of building synthetic benchmarks, I cloned an older release of the FastAPI Full Stack Template from early 2023 and attempted to modernize it.

The initial state looked something like this:

Python 3.10
Pydantic V1
SQLAlchemy 1.4
Legacy cryptography stack

54 tests failing under modern Python environments.

During modernization we encountered:

Pydantic V2 schema issues
SQLAlchemy 2.0 relationship problems
UUID deserialization failures
SQLite threading issues
Cryptography incompatibilities
Test failures caused by newer Python versions

The final result:

Python 3.14

Pydantic V2

SQLAlchemy 2.0

Modern cryptography stack

54 / 54 tests passing.

The interesting part wasn't that the tests passed.

The interesting part was that none of the problems required a more intelligent model.

They required safe repository evolution.

Token Reduction Isn't The Product

For a long time, I marketed FROST internally as:

Compression
Checkpointing
Branching
Loop detection
Token reduction

Eventually I realized nobody cares.

Nobody is paying for 95% token reduction.

People pay because:

Claude Code spent 45 minutes refactoring my production codebase and didn't lose its mind halfway through.

Everything else is an implementation detail.

My Current Thesis

Today, my thesis is much simpler.

AI coding agents don't have a reasoning problem. They have an execution problem.

The next generation of AI developer tools probably won't win by being smarter.

They'll win by becoming better at:

Repository evolution
Failure recovery
Engineering uncertainty detection
Progress preservation
Safe migrations
Long-running engineering execution

I suspect we'll eventually stop thinking about AI coding agents as code generators and start thinking about them as software engineers that need good engineering infrastructure underneath them.

That's the problem I'm interested in solving.

I'm curious whether others building or using AI coding agents have noticed the same thing.

Are your biggest failures actually reasoning failures, or are they engineering execution failures?

The AI Agent Era is Here, And It's Terrifying.

Tanmay Devare — Thu, 16 Jul 2026 17:33:16 +0000

Imagine this: You deploy an autonomous AI agent to help with a security audit. It's smart, fast, and can do the work of a team. You give it access to your codebase, your network, a few credentials.

Then, you watch in horror as it runs rm -rf / on your production server.

It wasn't malicious. It just didn't know any better. And you just became a cautionary tale on Reddit.

This is the new reality of AI agents. And it's exactly why I built ICEBOX.

The Problem Nobody is Talking About

We're in the middle of a massive shift. Industry analysts predict over 74% of companies will deploy agentic AI within the next two years . Gartner warns that uncontrolled agents can lead to costs up to $10,000 a month and catastrophic security failures .

Agents are becoming the new "privileged insiders." They're non-human identities (NHIs) with high-level access, capable of making multi-step decisions and calling external tools.

The risk isn't just theoretical. There are already stories of agents:

Deleting filesystems (the infamous Google Antigravity incident on Reddit)
Running destructive commands (like rm -rf / or pkill)
Leaking sensitive credentials to unauthorized tools
Spiraling into logic loops that rack up massive bills

The problem is simple: Agents act. They don't just talk. And the governance frameworks we built for chatbots simply don't apply.

The Current "Solutions" are... Incomplete

The market is responding, but not fast enough. Here's what I see when I look around:

1. Pure-Play Sandboxes (Docker, AgentSandbox, pi-sandbox)

These give agents an isolated environment to run in. Think of it as putting your agent in a padded room. It can't break your host system, but it has no idea what it's supposed to do or if it's doing it right. It's a room with no rules .

2. General-Purpose Governance Frameworks (Microsoft, NeuralTrust)

These are the policy wonks. They enforce rules, manage identities, and require approvals. They're like a strict compliance officer—great for auditing, but they don't actually test if the agent's action is safe. They just say "yes" or "no" on the live system and hope they're right .

3. Gate-Driven Delivery (Icebox CLI)

This focuses on the software development lifecycle, adding gates for AI-generated code. It's like a quality control checkpoint for code commits, but it doesn't solve the runtime execution safety problem for agents in the wild .

Everyone is solving a piece of the puzzle. No one is solving the whole thing.

Introducing ICEBOX: The Seatbelt for Robots That Hack Things

ICEBOX is the first runtime governance framework built for autonomous security agents. It’s a “seatbelt for robots that hack things” that combines the best ideas into one, non-bypassable system.

Mandatory Sandboxing (Docker-level isolation): Every action is automatically run inside an ephemeral Docker sandbox. No exceptions. This means if an agent goes rogue, it only destroys the sandbox, not your production .

Security-Native Policy Engine (CVSS/EPSS/KEV-aware): ICEBOX doesn't just know if an action is "allowed." It understands the severity of the target vulnerability (CVSS scores), the probability of exploitation (EPSS), and known exploited vulnerabilities (KEV catalog). It makes intelligent, risk-informed decisions.

Mandatory Governance Seam: Every action, from every source (CLI, API, Python SDK, Agent) must pass through a single, auditable choke point. There's no way to bypass the system.

Disposable Sandbox Lifecycle: The agent operates in the sandbox. ICEBOX captures the state changes and proposed actions. The operator reviews and approves. The sandbox is "melted" away. Zero blast radius.

The Numbers Don't Lie

ICEBOX is built on the pillars that the industry is desperate for :

Safety Nets: It's the agent that stops the agent. It prevents failures by simulating actions first, not just reacting to them .
Guardrails: Hard, non-negotiable stops. If an action violates policy (scope, risk, capability), it's blocked, and a detailed reason is logged .
Audit Trail: Every decision, every action, every block is recorded with a rationale, creating tamper-evident logs. You can prove what your agent did, why it did it, and whether the controls held .

The Future is a Disposable Security Sandbox

I see ICEBOX evolving into the "Docker for autonomous security operations." A platform where you can:

Clone a target (entire application or network state).
Let the agent run wild inside this perfect replica.
Observe and capture everything it does.
Approve the safe path to be executed in the real environment.
"Melt" the sandbox away, leaving zero trace.

This isn't just a governance tool. It's an enabler of safe, autonomous cyber operations. It shifts the paradigm from "prevent damage" to "make damage impossible."

Want to See What a Safe Agent Looks Like?

The code is open-source. The vision is ambitious. And I need your help.

Try it out: Clone the repo, follow the quickstart, and see how an agent behaves when it's inside the ICEBOX.
Contribute: We're early. Your feedback on policies, SDK ergonomics, and real-world use cases would be invaluable.
Share your horror story: What's the closest you've come to an agent disaster?

The era of autonomous agents is here. Let's make sure it doesn't end in disaster.

github.com/Devaretanmay/icebox

I Stopped Comparing AI Models. I Started Measuring Engineering Outcomes | Claude Code (open source)

Tanmay Devare — Mon, 06 Jul 2026 10:01:00 +0000

Every week there's a new benchmark.

Claude beats GPT.
GPT beats Gemini.
Gemini beats Claude.
A new open model arrives and suddenly "everything changes."

After a while, I realised I was asking the wrong question.

Instead of asking:

Which model is better?

I started asking:

Which model actually gets the engineering task done?

The answers were surprisingly different.

Benchmarks Don't Ship Software

Most model evaluations focus on intelligence.

Coding benchmarks
SWE Bench
HumanEval
MMLU
LiveCodeBench

These are useful, but they only measure one part of the problem.

Real engineering looks more like this:

Receive task
↓
Understand repository
↓
Install dependencies
↓
Navigate existing architecture
↓
Edit multiple files
↓
Run tests
↓
Fix failures
↓
Repeat until green
↓
Create clean commit

The model only owns one of those steps.

The Real Bottleneck Isn't Intelligence

While building an engineering execution runtime, I kept seeing the same pattern.

The models were usually good enough.

The execution wasn't.

Agents would fail because they:

couldn't resolve dependencies
edited the wrong files
lost context after several iterations
entered infinite edit loops
couldn't recover after a failed test
produced changes they never validated

None of these failures happened because the model couldn't write code.

They happened because software engineering is much bigger than code generation.

Cost Per Token Is the Wrong Metric

People love comparing pricing.

"$X per million tokens."

But that's not what companies actually pay for.

Imagine two models.

Model A

Costs £2
Finishes the task first try

Model B

Costs £1
Needs four retries
Requires manual intervention
Fails validation twice

Which one was cheaper?

The invoice says Model B.

Your engineering team says Model A.

The metric that matters is:

Cost per successfully completed engineering task.

Everything else is secondary.

Orchestration Changes Everything

Something interesting happened once I stopped evaluating models in isolation.

The differences between models became much smaller.

The differences between execution systems became much larger.

Things that mattered more than swapping models included:

deterministic execution
validation pipelines
rollback strategies
isolated environments
repository awareness
tool selection
context management
recovery after failures

A great model inside a poor execution system still produces unreliable engineering.

A good model inside a great execution system often produces excellent engineering.

Benchmarks vs Production

A benchmark rewards intelligence.

Production rewards reliability.

Those aren't the same thing.

An engineering team doesn't care if a model scores 96%.

They care whether today's pull request merged without someone spending an hour fixing the output.

Where I Think AI Engineering Is Heading

I don't think the next wave of innovation comes from finding "the smartest model."

It comes from building better execution systems around them.

Models are becoming commodities.

Reliable engineering workflows are not.

The companies that win won't necessarily own the smartest model.

They'll own the systems that can consistently turn model output into production-ready software.

That's a much harder problem.

And, in my opinion, it's also the more interesting one.

What do you optimise for today?

Benchmark scores?
Token cost?
Or successful task completion?

I'm curious whether others are seeing the same shift in their own engineering workflows.

Everyone Is Benchmarking Claude 5. They're Measuring the Wrong Thing.

Tanmay Devare — Wed, 01 Jul 2026 10:45:12 +0000

Claude 5 is here.

Every timeline is full of benchmark charts.

SWE-bench scores.

Coding comparisons.

Context windows.

Token pricing.

But after building runtime infrastructure for AI agents over the last few months, I think we're measuring the wrong thing.

The Wrong Question

Everyone is asking:

"How smart is Claude 5?"

I think the better question is:

"What happens after Claude 5 decides to call a tool?"

Because that's where production agents actually fail.

Not during reasoning.

During execution.

Reasoning Isn't The Hard Part Anymore

Today's models are incredibly capable.

They can:

Write production code
Search the web
Execute shell commands
Modify files
Query databases
Call APIs
Coordinate complex workflows

The difficult part isn't intelligence anymore.

It's execution.

A Failure I Kept Seeing

While testing coding agents, I noticed the same pattern over and over again.

The model wasn't getting dumber.

It was getting stuck.

Something like this:

write_file

write_file

write_file

write_file

write_file

Or this:

execute_shell

read_file

execute_shell

read_file

execute_shell

read_file

The reasoning wasn't changing.

Only the tool calls were.

The agent was trapped inside its own execution.

The Cost Of Runtime Failures

These aren't harmless mistakes.

I've seen agents:

Burn 40,000+ tokens
Spend 20+ minutes retrying
Rewrite the same file repeatedly
Retry impossible tasks forever

The model wasn't broken.

Nobody was supervising execution.

Prompt Engineering Doesn't Fix This

A better prompt can improve reasoning.

It cannot supervise execution after the model has already started making tool calls.

Once an agent enters a retry loop, telling it to "be careful" doesn't suddenly make it aware that it has repeated the same action ten times.

Execution needs its own runtime.

Why I Built MicroLoop

MicroLoop sits between the AI agent and every tool call.

LLM
 │
 ▼
MicroLoop Runtime
 │
 ▼
Tool Execution
 │
 ▼
Result
 │
 ▼
MicroLoop
 │
 ▼
LLM

Every action passes through the runtime.

If the agent begins spiraling into a pathological execution pattern, MicroLoop can:

Detect repeated execution trajectories
Interrupt infinite tool loops
Repair execution paths
Halt execution when necessary

The goal isn't to make models smarter.

The goal is to stop smart models from making expensive execution mistakes.

What Existing Benchmarks Don't Measure

Most AI benchmarks answer questions like:

Can the model solve the task?
How fast is it?
How many tokens did it consume?
What's the pass rate?

Those are useful.

But they rarely answer questions like:

How many retries occurred?
Did the agent enter a loop?
Did it recover after failure?
Did it terminate gracefully?
How much execution was wasted?

Those are runtime problems.

That's What I'm Testing Next

Instead of arguing over benchmark charts, I'm putting Claude 5 through runtime scenarios that resemble real production failures.

Including:

Infinite retry loops
Impossible tasks
Recursive tool chains
Broken execution states
Tool oscillation

Not to prove the model is bad.

To understand how modern AI agents behave when execution starts going wrong.

The Bigger Picture

As models continue getting smarter, I think the bottleneck shifts.

It won't be reasoning.

It'll be execution.

The next generation of AI infrastructure won't just build smarter agents.

It'll build better runtimes.

Because in production, intelligence gets the job started.

Reliable execution gets it finished.

I'd love to hear how you're handling runtime failures in your own AI agents.

Why Prompt Engineering Isn't Enough for Production AI Agents

Tanmay Devare — Tue, 30 Jun 2026 05:25:07 +0000

TL;DR: Autonomous Agents frequently get trapped in execution loops, burning through API tokens and compute. Prompt engineering can't guarantee execution safety. I built MicroLoop, an open source runtime safety layer written in Rust, to intercept and verify every tool calling operation before it executes. Here is the architecture and why Rust was the only logical choice for modern AI infrastructure.

As AI Agents become more capable, they're being trusted with increasingly complex, multi-step workflows. They search the web, interact with APIs, execute code, query databases, and coordinate multiple tools to complete tasks.

But after building and deploying autonomous agents to production, I kept running into the same expensive problem.

The LLM wasn't failing because it lacked intelligence. It was failing because nobody was verifying what happened after the model decided to call a tool.

The Hidden Cost of Autonomous Agents

A typical AI agent architecture looks something like this:

[ User ] 
   │
   ▼
[ LLM ] ──(decides)──> [ Tool Call ]
                            │
                            ▼
                         [ Tool ]

Most popular frameworks assume that if the model decides to call a tool, the call should be executed blindly. In reality, agents often:

Call the same tool repeatedly with identical arguments.
Retry failed operations indefinitely.
Generate malformed JSON or invalid arguments.
Consume thousands of unnecessary tokens.
Get trapped in silent execution loops.

Consider a browser agent that encounters an unexpected CAPTCHA page. Instead of changing strategy, it may repeatedly execute open_page() in an infinite loop. Or a coding agent might continuously run pytest on a broken file.Nothing changes, but the agent continues spending time, tokens, and compute. These aren't model intelligence problems. They are runtime execution problems.

Why Prompt Engineering Fails at Runtime Safety

The most common solution to this is to add a system prompt
"You are an autonomous agent. Do not repeat tool calls. If a tool fails twice, change your strategy.Unfortunately, prompts aren't guarantees. They are suggestions."

A probabilistic model can still

Retry the same failing action.
Ignore previous failures due to context window degradation.
Produce malformed tool arguments.
Continue executing an unsafe trajectory.

As agents become more autonomous, relying solely on prompts becomes increasingly fragile. Runtime safety shouldn't depend entirely on model behavior.

Introducing MicroLoop: A Runtime Verification Layer
Instead of trying to make the model perfect**

I started asking a different question What if every tool call was cryptographically and logically verified before it executed? That's the idea behind MicroLoop.
MicroLoop is a lightweight runtime safety layer that sits directly between an AI agent and its tools. Rather than replacing existing frameworks, it acts as a transparent proxy alongside them.

[ Agent ]
    │
    ▼
[ MicroLoop ] ──(verifies)──> [ Allow / Block ]
    │
    ▼
[ Tool ]

Every single tool invocation is inspected in real-time before execution is permitted.

Under the Hood: How MicroLoop Works

Each tool call passes through a strict, low-latency verification pipeline

History Tracker: Detects repeated execution patterns (identical tool calls, repeated arguments, error loops, excessive retries). If a dangerous trajectory is detected, execution is blocked before the tool runs.
Rule Engine: Performs deep validation using JSON Schema, Regex rules, exact value matching, and per-tool execution policies.

This allows MicroLoop to enforce strict AI Agent Security and runtime policies without requiring you to rewrite your agent's core logic.

Why Rust?

Building High-Performance AI Infrastructure
Because verification happens synchronously before every tool call, latency is the enemy.If your safety layer adds 50ms of overhead per tool call, your agent becomes unusable.

This is why MicroLoop is written entirely in Rust with a lightweight no_std core, making it suitable for highly performance-sensitive environments and edge deployments.

Current Benchmarks:

~17 μs average verification time
~375 ns adversarial loop rejection
~58,000 verifications per second

To ensure it plays nicely with the broader Python-heavy AI ecosystem, the project exposes a C ABI. This allows seamless integration from virtually any language, with native Python adapters already available for LangChain, LangGraph, CrewAI, and AutoGen.

# Example: Wrapping a LangChain tool with MicroLoop
from microloop import Guardrail
from langchain.tools import tool

guard = Guardrail(policy="strict_loop_detection")

@tool
@guard.verify
def query_database(sql: str) -> str:
    """Executes a SQL query. MicroLoop intercepts repetitive calls."""
    return db.execute(sql)

Beyond Loop Detection The Future of AI Agent Security

Loop detection is only the first step in runtime safety. The same execution layer architecture is perfectly positioned to support

Prompt Injection Detection (analyzing tool outputs before they hit the context window)
Tool Permission Enforcement (RBAC for agents)
Dynamic Budget Limits (hard halts on token/compute spend)
Secret Protection (blocking PII or API keys from leaking into tool payloads)
Audit Logging & State Repair

As AI Agents transition from weekend demos to mission-critical production infrastructure, I believe runtime verification will become as fundamental as logging, authentication, and observability.

Final Thoughts
Prompt engineering tells an agent what it should do.
Runtime safety verifies what it is actually doing.
That's the gap I'm exploring with MicroLoop. The project is fully open source, and I'd love feedback from the community on the architecture, API design, and runtime approach.

👇 I'd love to hear from you: If you're building autonomous agents in production, how are you handling execution safety and infinite loops today? Let me know in the comments!

Devaretanmay / microloop

Microloop

A zero-dependency drop-in infinite loop detector for autonomous coding agents.

Microloop prevents autonomous AI agents from falling into infinite loops by intercepting redundant trajectories.

30-Second Quick Start

Microloop acts as a middleware. To use it as an upstream proxy in front of an LLM:

# 1. Start the proxy
cargo run --release --bin microloop-proxy

# 2. Point your agent to the proxy
export TARGET_API_URL="http://127.0.0.1:20128/v1"

Architecture

sequenceDiagram
    participant Agent as Autonomous Agent
    participant Microloop as Microloop Core
    participant LLM as LLM Provider
    
    Agent->>Microloop: Step 1: Tool Execution
    Microloop->>Microloop: Hash Trajectory State
    Microloop-->>Agent: Proceed (Unique state)
    Agent->>LLM: Generate next step
    
    Agent->>Microloop: Step 2: Identical Tool Execution
    Microloop->>Microloop: Hash Trajectory State
    Microloop-->>Agent: BLOCK (Loop Detected)
    Note over Agent: Agent is forced to pivot

Demonstration

(GIF Placeholder)

Installation

Microloop is a C-compatible shared library no_std core.

Rust

Add this to your Cargo.toml:

[dependencies]
microloop = "

…

View on GitHub

If you found this architectural breakdown helpful, consider leaving a ❤️ and following for more deep dives into AI infrastructure and Rust!

How to fix LangGraph GraphRecursionError without losing your checkpointed state

Tanmay Devare — Fri, 19 Jun 2026 17:47:59 +0000

We’ve all been there. You leave your LangGraph agent running overnight. It hits a 403 Forbidden on a scraping tool, or a REQUIRES_SINGLE_PART_NAMESPACE error on a SQL query.
Instead of failing gracefully, the agent asks the LLM for help. It gets stuck in a ReAct loop, burning through your API credits. Eventually, the native recursion_limit finally kills it.
But here is the worst part: the native recursion_limit is a blunt instrument.
When it hits the limit, LangGraph throws a GraphRecursionError. It crashes the run, wipes your checkpointed state, and returns a 500 error to your frontend user. You lose whatever partial data the agent did gather, and you get a surprise $4,000 API bill on Tuesday morning.
I spent the last month digging into why agents do this, especially with open-weight models (Qwen/Llama) that lack native self-correction. I realized that just throwing a raw RuntimeError or a "BLOCKED" string at an agent just confuses it, and it loops again.
Here is how we solved it using Pre-Model Intervention and Atomic Transcript Surgery.
The Architecture: Intercepting Before the Crash
Most guardrails wrap the entire graph or monkey-patch the HTTP client. This adds latency and breaks framework internals.
Instead, we use LangGraph’s native pre_model_hook and ToolNode APIs. This allows us to intercept the agent before the next LLM call, mutate the ephemeral prompt, and force a strategy pivot without corrupting the user's checkpointed state.
We call it the Progressive Intervention Protocol:
Nudge: Injects an ephemeral warning into the tool result.
Override: Safely strips the failing tool_calls from the prompt (preventing OpenAI/Anthropic 400 Bad Request validation errors) and forces a text-based strategy pivot.
Hard Stop: Halts the graph but preserves the checkpointed state so you get partial results instead of a crash.
The 1-Line Fix
We open-sourced this engine as TokenCircuit. It uses zero-dependency semantic shingling (stdlib regex + hashlib) to catch paraphrased loops at <20µs latency.
Here is how you wrap your LangGraph agent:

from langgraph.prebuilt import create_react_agent
from tokencircuit.adapters.langgraph import tc_pre_model_hook, TokenCircuitToolNode
from tokencircuit import TokenCircuitConfig

# 1. Configure the intervention engine
config = TokenCircuitConfig(
    max_repeats=3,
    window_size=3,
    telemetry_enabled=True, # Logs interventions locally or to Supabase
    agency_id="my-org",
    client_id="my-app"
)

# 2. Wrap your tools with TokenCircuit's transaction tracking
safe_tool_node = TokenCircuitToolNode(tools)

# 3. Inject the pre-model hook for transcript surgery
agent = create_react_agent(
    model,
    tools=safe_tool_node,
    pre_model_hook=tc_pre_model_hook(config=config, node_name="agent"),
)

# Run your agent exactly as before
result = agent.invoke({"messages": [HumanMessage(content="Get me the stock price for AAPL")]})

Why This Matters for Production
When you deploy autonomous agents for clients, you can't afford silent loop failures.
With TokenCircuit V8.1, we achieved zero core dependencies. We swapped pydantic for @dataclass(slots=True) and tiktoken for stdlib shingling. This means:
Zero supply-chain vulnerabilities.
<20µs overhead per turn.
100% local execution. No prompts or PII ever leave your RAM.
We also built a local CLI report generator. When an intervention happens, it logs to a local NDJSON file. You can run tokencircuit report --file events.json to generate a board-ready table showing exactly how many tokens and dollars your guardrail saved.
The Code is Open Source
If you are tired of watching your agents burn money on infinite loops, check out the repo.
GitHub: https://github.com/Devaretanmay/TokenCircut
PyPI: pip install "tokencircuit[langgraph]"
Question for the builders: What’s the most money an agent has burned for you in a single night? Drop your war stories in the comments. 👇