DEV Community: Anil Murty

What is Human-In-The-Loop (HITL)?

Anil Murty — Mon, 25 May 2026 18:44:19 +0000

This post originally appeared on tokenjam.dev/blog.

Human-in-the-loop (HITL) for AI agents means inserting human approval, review, or intervention into an agent's execution at specific decision points: before high-stakes actions, or when agent confidence is low. Rather than letting an agent act autonomously, HITL creates a checkpoint where a human must explicitly approve, review, or reject an action before or after it runs. For agents that act on the real world by sending emails, deleting data, or moving money, HITL is the difference between an agent you can trust in production and one you cannot.

Why HITL matters for production agents

Even well-tested agents make mistakes. A model might misinterpret a user request, hallucinate an action, or hit an edge case the training data never covered. In low-stakes contexts (generating a report, drafting an email), mistakes are annoying. In high-stakes contexts (sending customer communications, deleting database rows, initiating financial transfers), mistakes are costly.

Consider a customer-support agent that composes and sends email responses. The agent may be 99% accurate. That 1% error could send a rude or inaccurate message to a paying customer, damaging trust and creating extra work to undo. A financial agent may correctly categorize most transactions, and one misclassified refund could still throw off an audit. An administrative agent tasked with purging old files might delete something recent if its date parsing fails.

HITL creates a human checkpoint before or after such actions. A human reviews a draft email before send, flags a suspicious transaction after the fact, or interrupts the agent before it deletes a file. This is not about removing trust from the agent. It's about accepting that perfect automation is rare, and pairing the agent's speed with human judgment where it matters most.

Three patterns

Pre-execution approval

The agent pauses before executing a high-stakes action and requests explicit human approval. The human reviews the proposed action, context, and reasoning, then approves or rejects. The agent only proceeds if approved.

Example: A billing agent proposes to refund a customer's subscription. It drafts the refund request, displays the amount, reason, and customer history, then waits for a human accountant to click "approve" or "reject" via Slack. Once approved, the agent executes the refund. If rejected, the agent logs the decision and may retry with a modified request or escalate further.

Pre-execution approval is high-friction and lowest-risk: no unintended actions slip through. It suits infrequent, high-value decisions.

Post-execution review

The agent executes an action, then a human reviews the outcome retroactively. If the human spots a problem, they flag it and the agent can undo, correct, or escalate.

Example: A content-moderation agent flags user comments as spam or policy-violating and removes them. A human reviewer checks a sample of removed comments each day. If the human spots a false positive (a legitimate comment that was wrongly removed), they restore it, log the error, and the agent adjusts its thresholds or retrains. If the human spots a genuine miss (spam that was not caught), they delete it and the agent logs the gap.

Post-execution review is lower-friction than pre-approval. The agent moves fast. The risk is higher: bad actions have already happened. It suits high-volume, lower-risk operations where human review is asynchronous and sampled rather than exhaustive.

Exception-based escalation

The agent runs normally. If it detects low confidence, a sensitive category, or a policy violation, it escalates to a human before proceeding. This hybrid approach reserves human time for the cases that need it.

Example: A hiring-pipeline agent screens resumés and schedules initial interviews. Most resumés are clearly unqualified or qualified, and the agent processes those automatically. If a resumé is borderline (confidence between 40–60%), or the candidate is internal staff applying for a different role (flagged as sensitive), the agent pauses and sends an approval request to the hiring manager. Once approved, it advances or rejects the candidate. If not approved within 48 hours, the agent escalates to HR or applies a default action.

Exception-based escalation balances speed and oversight. Most work is automated. Edge cases get human eyes.

Approval channels and their tradeoffs

Humans can approve actions through several channels, each with its own latency, friction, and auditability profile.

Slack: Fast, ambient, and familiar for teams. An agent posts a message with action details and two buttons: "Approve" and "Reject." A human sees the notification, clicks a button, and the agent resumes. Slack suits small, quick decisions (approve a single email, confirm a deletion) and teams already living in Slack. Tradeoff: if the person is offline or drowning in messages, approval latency spikes. The audit trail is tied to Slack's message retention policies.

Email: Lower friction than dedicated UIs, familiar, and works across time zones. The agent sends a structured email with action details and a unique approval link. The human clicks the link, authenticates if needed, and confirms or rejects. Email suits asynchronous workflows where a few hours of latency is acceptable and audit trails matter. Tradeoff: email can be slow and noisy, and there's phishing risk if links are not properly verified.

SMS: Fastest for urgent decisions. The agent sends a one-liner ("Refund $50 to customer X?") and a link. The human replies or clicks. SMS works for on-call scenarios, high-stakes interrupts, and people who respond faster to texts than messages. Tradeoff: limited context fits in an SMS, so it's useful only for binary decisions or actions with very short descriptions.

Telegram or custom UIs: Telegram offers a middle ground: richer than SMS, less dependent on corporate infrastructure than Slack. Custom UIs (a simple web dashboard) give full control over presentation and can display detailed context, timelines, and audit logs. Tradeoff: custom UIs require infrastructure and don't leverage channels teams already use. Telegram adds a third platform to monitor.

Practical guidance: Start with the channel your team already uses (Slack for most teams). If latency is critical, add SMS for escalations. If audit requirements are strict, add email with signed links. Don't multiply channels unless necessary. Too many approval surfaces fragment attention.

The async-execution problem

An agent proposes an action at 2 PM. A human approves it at 4 PM. The agent should not lose that approval or hang indefinitely waiting.

This is the async-execution problem: agents have to suspend, preserve state, wait hours or days for human approval, and resume cleanly. If your agent process restarts while waiting, the approval is lost. If the agent is polled constantly, it burns compute.

Durable execution is the solution. The agent writes its pending action and request ID to a durable store (database, persistent queue, event log). It includes a callback URL. When a human approves, the system calls that URL with the request ID and decision. The agent resumes from that callback, looks up the stored request, and executes it.

Example architecture:

Agent calls tool request_approval(action_id="transfer_5k", amount=5000, recipient=customer_id).
System generates a unique request ID, stores it with full context, and sends an approval request to Slack.
Agent yields control and waits. The process can terminate; state is persisted.
Human approves via Slack at 4 PM; the Slack button posts to a webhook.
Webhook handler looks up the request, checks it is still valid, and calls the agent's resume callback with the decision.
Agent resumes, executes the transfer, and logs the outcome.

This design survives crashes, allows long waits, and scales to thousands of concurrent approvals. Most production HITL tools handle this internally. You call an approval function and the framework handles persistence.

Notable tools

HumanLayer provides a decorator pattern for approval. You mark a function with @hl.require_approval() and it blocks until a human approves via Slack, email, SMS, or Discord. HumanLayer handles persistence, multi-channel routing, and callback logic. It works with any Python LLM framework and is open-source. See HumanLayer on GitHub. HumanLayer also ships the Agent Control Plane (ACP), a Kubernetes-native scheduler for unsupervised agents that builds on the same approval primitives.

OpenAI Agents SDK includes native HITL primitives. You set require_approval="always" on a tool, and the SDK pauses execution and surfaces the approval request in the RunState. The SDK handles session persistence so approvals survive multiple turns. It supports tool-level and nested-agent approval. See OpenAI Agents SDK HITL docs.

Permit.io is primarily an authorization and access-control platform. It includes an MCP (Model Context Protocol) server that wraps access requests as agent tools. Tools must pass Permit's policy engine before execution, and policies can require human approval via a dashboard UI. Permit logs every decision and policy change for audit. It pairs well with LangGraph or other frameworks. See Permit.io HITL blog.

The three tools sit at different points: HumanLayer for speed and multi-channel routing, the OpenAI Agents SDK for tight integration with OpenAI's stack, Permit.io for organizations with strict compliance or fine-grained access-control needs.

HITL as a wedge into governance

Once you have approval primitives, you have most of what you need for policy enforcement. An approval request is a checkpoint where policy rules can be checked and logged. A human approver can be a policy enforcer: they don't approve or reject on gut feel, they decide against a policy document.

A finance policy might state: "Any expense over $5,000 requires approval from a director; over $25,000 requires approval from the CFO." An agent encodes these rules as escalation logic. Expenses under $5,000 auto-approve. $5,000–$25,000 routes to a director. Over $25,000 routes to the CFO. The agent does not guess; it follows policy.

This is the foundation of agent governance: policies define what an agent can and cannot do, HITL enforces those policies at runtime, and audit logs prove compliance. See What is an agent control plane for how HITL integrates into broader governance.

Common questions

When should I add HITL to my agent?

Start with high-stakes actions: money, data deletion, external communication. Add HITL if the cost of an error (customer damage, compliance violation, data loss) is high, or if the action is infrequent enough that human review doesn't create a bottleneck. For low-stakes actions (generating a draft, internal logging), HITL is usually overkill.

My agent is stuck waiting on an approval nobody answered. What happens to it?

Whatever you designed it to do, which is the point: you have to design it. An approval request that never gets answered needs a timeout and a default action. Common patterns: after N hours, escalate to a second approver (the original approver's manager, or an on-call rotation); after a longer window, apply a safe default (reject the action, or take the conservative branch); for genuinely blocking work, fail the task and surface it to a human queue. The mistake is leaving the timeout undefined. Then the agent either hangs forever (burning a durable-execution slot) or a process restart loses the pending state entirely. Decide the timeout and the default when you add the approval gate, not after the first one strands a task over a weekend.

Can HITL be automated for low-risk cases?

Yes. Use tiered approval: low-risk actions auto-approve or require only post-execution review. Medium-risk actions route to a junior approver or a policy engine. High-risk actions route to a senior human. You can also use agent confidence: above a threshold, auto-approve; otherwise, escalate. This blends human judgment with automation.

How do I keep approval requests from burying my team in Slack pings?

Approval fatigue is real. A team that gets 80 approval pings a day starts rubber-stamping them, which defeats the point. Three levers. First, raise the threshold: if 90% of requests get approved without changes, the gate is too sensitive, so move more actions to post-execution review or auto-approve. Second, batch: instead of one ping per action, send a digest ("12 refunds pending, total $940, review all") that a human clears in one pass. Third, tier by reviewer: route low-value approvals to a junior queue or a policy engine, reserve senior humans for high-stakes calls. The signal you've got the balance right is that approvals feel meaningful: each one is a real decision, not a reflex.

Originally appeared on tokenjam.dev/blog.

What is an Agent Environment?

Anil Murty — Tue, 19 May 2026 20:29:31 +0000

This post originally appeared on tokenjam.dev/blog.

An agent environment is the isolated runtime context in which an AI agent acts. It is a sandbox with bounded tools, resources, and permissions, giving the agent access to code execution, browser automation, file systems, or a desktop without putting your own machine at risk. The environment defines what the agent can touch and what happens if it makes a mistake.

Why it matters

Agents make decisions, and some of those decisions are destructive.

A code-writing agent asked to "refactor my project" might misread a path and run rm -rf /tmp/important_backup instead of rm -rf /tmp/build_cache. A scraping agent running in your real browser context could harvest cookies and auth tokens. An email agent could send a message to the wrong recipient at the wrong moment. A financial agent could place orders or initiate transfers if you give it the keys.

These are not failures of intent. They are failures of grounding. Agents operate over text, patterns, and learned associations, and they do so without the kind of common-sense check that a human would apply. They hallucinate paths. They misread context. They pursue goals in ways you did not anticipate.

Isolation is what makes those failures recoverable. If a sandboxed agent deletes a file, it deletes a file inside the sandbox, not on your laptop. If it spins on an expensive API call, you capped that quota beforehand. If it loops, you kill the process. Without isolation, every mistake is permanent.

Categories of environments

Code execution sandboxes

These let agents write and run code (Python, Node.js, Go, whatever) without touching your machine.

E2B builds on AWS Firecracker microVMs. Each agent gets its own tiny Linux VM with filesystem, processes, and networking. E2B handles session management, code execution, file I/O, and teardown. Firecracker handles kernel-level isolation underneath.

Modal provides serverless code execution in the cloud. Agents define functions in Python and run them on Modal's infrastructure. Useful for long-running tasks, heavy computation, and workflows where you want execution to persist across agent steps without managing servers.

Browser sandboxes

Agents that need to click buttons, fill forms, or extract data from the web need a browser they do not share with you.

Browserbase pairs managed Chrome or Firefox instances with automation tools like Stagehand. Each agent session gets its own browser profile, cookie jar, and storage. Good for navigating dashboards, scraping dynamic sites, or running web tasks without polluting your own browser history.

Anthropic Computer Use is a more general version of the same idea. The agent sees a virtual desktop screenshot, decides where to click, and the action executes on a remote screen. It generalizes past browsers to any application: a spreadsheet, a CRM, a design tool. The agent works with any UI, at the cost of slower per-step latency and a wider surface for visual reasoning errors.

Dev workspaces

Some agents need a full development environment: shell, git, editor, compiler, package manager.

Daytona provisions ephemeral dev workspaces in the cloud. An agent spawns a workspace, clones a repo, edits files, runs tests, and commits changes, all in a temporary environment that shuts down when the task is done. The setup cost of standing up a VM and configuring a toolchain is abstracted away.

Simulation environments for evaluation

Evaluating agents needs a controlled, repeatable environment where outcomes can be measured and replayed.

HUD provides web and desktop task environments for agent testing: simulated sites, forms, and application interfaces where agents practice without touching real services.

Inspect AI Sandboxing Toolkit (from UK AISI) bundles evaluation environment setup with the broader Inspect assessment framework. You define sandboxed tasks, run agents against them, and collect structured results.

These are not separate from production environments. They use the same infrastructure with deterministic data, controlled time, and instrumentation that captures every action.

The microVM trend

The shift toward microVMs is what makes per-agent-step isolation practical.

Traditional VMs boot in seconds, allocate hundreds of megabytes, and carry the overhead of a full OS startup. Firecracker flipped that tradeoff: it boots a Linux environment in around 125ms, uses 3–10MB of memory, and still provides real kernel-level isolation. You can spawn a fresh, isolated environment for every agent step without paying a meaningful performance cost.

That removes the temptation to reuse environments across agents or requests. Each agent run gets its own fresh VM. No shared state, no cross-contamination, no lingering processes from a prior task. E2B is built on Firecracker. Other newer sandbox services use it too.

This is AWS Lambda's security model applied to agents: isolation as the default, not the exception.

Computer-use agents

Computer Use (Anthropic) and Operator (OpenAI) represent a different environment class: the virtual desktop.

Instead of exposing specific APIs (run code, fetch a URL), these environments give the agent a screenshot and accept mouse and keyboard input. The agent reasons about what is visible and decides where to click or what to type. The generalization is the appeal. Any application becomes accessible without a custom integration.

The tradeoffs are real. Screenshot reasoning is slower than API calls. Visual reasoning hallucinates more often than text reasoning. And UI changes between runs make deterministic evaluation harder than for API-driven agents. Computer Use is the right tool when no API exists. It is the wrong tool when one does.

Environments and evaluation share infrastructure

The sandbox that runs your agent in production is the same environment you use to evaluate it. The only differences are the data and the instrumentation.

An evaluation is an agent run in a sandbox with deterministic inputs (fixed test cases, snapshots of websites, pre-recorded data), measured outputs (did the agent click the right button, write the right code, extract the right field), and reset state between runs. The feedback loop is tight: same sandbox, different data, full instrumentation.

Some platforms (Inspect AI, HUD) are built around this pattern. Others (Browserbase, E2B) just expose the infrastructure in a way that makes evaluation natural to layer on top.

Common questions

Why not just run agents on my laptop?

For read-only or trivially safe tasks, you can. But the moment an agent has write access or network access, local execution gets risky. A misbehaving agent can delete files, consume bandwidth, or rack up cloud bills on your behalf. Sandboxes let you set explicit boundaries: this agent can read and write to /tmp/task_data, can call this API endpoint, has 5 minutes of CPU time. On your laptop, the only boundary is your own discipline, and discipline fails.

What's a microVM?

A lightweight virtual machine that boots and allocates resources much faster than traditional VMs. Firecracker (AWS) boots a Linux environment in around 125ms and uses 3–10MB of memory. For agents, that means you can safely spawn a new isolated environment for every task without performance overhead.

Are sandboxes secure enough for production?

Isolation is a spectrum, not a binary. Firecracker provides kernel-level isolation, which is strong. But no sandbox is perfect. Kernel exploits exist, side-channel attacks are possible, and network boundaries can be misconfigured. For most production agent use, current sandboxes are sufficient. They stop accidental harm and raise the bar for intentional attacks. For high-security workloads (financial data, PII, anything regulated), you layer on strict network filtering, signed code, and cryptographic verification.

Do I need a sandbox for read-only agents?

Yes, even though the risk is lower. A read-only agent confined to a sandbox with no outbound network access (except to approved sources) cannot exfiltrate data to an attacker. It also cannot get stuck in expensive loops, since you cap its CPU and request budget. Sandbox constraints are cheap insurance against failure modes you have not thought of yet.

This post originally appeared on tokenjam.dev/blog.

Optimizing your Claude Code usage (and spending less $$)

Anil Murty — Fri, 15 May 2026 23:24:24 +0000

This post from Anthropic is causing a lot of angst among its most ardent users. While it is disguised as a credit, it really is them tightening API token limits (aka charging more money). The reality is that this is going to become the norm across all model providers soon. Why? because all the labs have been deeply subsidizing tokens in a frenxy to attract the most number of users their platforms. They are doing this because more users leads to better models (more data to train on), drives mindshare and usage which drives up valuations.

Most Claude Code users I've talked to don't know how their tokens are being spent. Which project, which model, which sessions, what the monthly run rate is, when their credit pool runs out. Heck, I'll admit this was me until I started to work on tokenjam.dev

Here's the funny part: Claude Code already writes a JSONL log for every session to a folder on your laptop. Every API call, every tool result, every token count. The data you'd want to look at is sitting right there.

So we shipped a feature in TokenJam that reads and analyzes it.
it's called "tj optimize" see https://tokenjam.dev/claude-code

tj optimize does three things:

Backfills up to 30 days of your existing Claude Code session logs into a local DuckDB.
Flags sessions that match the structural shape of a smaller model's workload — short inputs, short outputs, few tool calls — and shows projected savings if you switched.
Projects your monthly spend against a budget you set, per provider, and tells you when you'll run out.

Three commands:
pip install "tokenjam[mcp]"
tj onboard --claude-code
tj optimize

Would love your feedback! If you find it useful, please check out the repo and give us a star: https://github.com/metabuilder-labs/tokenjam

What is Agent Memory and why does it matter?

Anil Murty — Thu, 14 May 2026 22:32:04 +0000

This post originally appeared on tokenjam.dev/blog.

Agent memory is the persistent state an AI agent maintains across sessions and beyond the LLM's context window. It stores facts the agent has learned, decisions it has made, and relationships it has tracked, so a future interaction can retrieve and act on them. Without memory, every session starts from zero.

Why it matters

A stateless LLM forgets everything when the conversation ends. That works for one-off questions. It breaks the moment you want an agent to recognize a returning user, track a multi-week goal, or improve based on past mistakes.

Memory is what turns a chatbot into something you can hand ongoing work to. It is also where the hardest unsolved problems live: how to compress conversations into useful facts, how to retrieve the right fact at the right time, and how to handle the moment when a user's stated preference today contradicts what they said three months ago.

Why memory is hard

Three problems make this a live research area.

Context windows are bounded. Claude Sonnet 4.5 has a 200K context. GPT-5 reaches 400K. Even at the high end, an agent serving one customer over six months accumulates more conversational data than any context can hold. You cannot just stuff history into the prompt.

Semantic recall is approximate. Vector embeddings let you ask "find facts similar to this query," but the result quality depends on phrasing, embedding model, and how facts were chunked when stored. Multi-hop reasoning ("connect fact A and fact B to answer question C") and temporal reasoning ("was that true last month?") both stress current approaches. Graph-based memory helps with multi-hop questions, at the cost of curating structure from unstructured chat.

Deciding what to forget is itself a design problem. Should the agent store every word, or distill summaries? When a user contradicts an earlier preference, do you delete the old fact, mark it invalid with a timestamp, or keep both and let retrieval pick? There are no universal answers. The right policy depends on whether you are building a personal assistant, a customer-support agent, or a coding agent that needs to remember repo conventions.

Categories of memory

Memory systems organize knowledge along two axes: temporal scope (within a session or across sessions) and representation (what form the knowledge takes).

Short-term and long-term

Short-term memory lives in the LLM's context window. It is the transcript of the current exchange. Cheap to implement, capped by context size, and gone when the session ends.

Long-term memory persists outside the context window in a database, vector store, or knowledge graph. The agent compresses short-term context into long-term facts before a session ends, then retrieves the relevant slice in the next session.

Semantic and episodic

Semantic memory holds knowledge without a timestamp: "this user prefers dark mode," "the team lead is Sarah," "our API rate limit is 1000 req/sec." It answers "what is true" questions. Vector indexes and knowledge graphs are the usual representations.

Episodic memory is tied to time and context: "on 2026-04-12 the user reported a checkout bug," "in session 147 the agent escalated to a human." It answers "what happened" questions and underwrites causal reasoning. Event logs or timestamped graph edges are typical.

Production systems blend both. Zep tracks when facts were true. Mem0 combines vector retrieval with graph relationships. Letta tiers everything through an OS-style hierarchy.

Memory is not RAG

This is the distinction worth being precise about, because the two get conflated constantly.

RAG (retrieval-augmented generation) reads from a fixed external corpus: a product manual, a docs site, a corpus of papers. The LLM consults that corpus at inference time. It does not write to it. The corpus is authoritative; the agent is a reader. RAG is excellent for "what is the API rate limit?" because the answer lives in one place and does not change based on conversation.

Agent memory is bidirectional. The agent writes facts during conversations ("the user prefers tea"), reads from memory to personalize responses, and updates memory when facts change. Memory is about the agent's own accumulated experience, not an external reference. An agent serving the same customer five times hits the same product docs each visit via RAG, and recalls what the customer asked about last time via memory.

The xMemory paper put it this way: RAG targets large heterogeneous corpora with diverse passages; agent memory deals with bounded, coherent dialogue streams whose spans are highly correlated. Most production agents use both. RAG for reference knowledge, memory for personalization and continuity.

Notable projects

The agent memory space matured fast across 2024 and 2025. Here are the systems worth knowing.

Letta (formerly MemGPT)

Letta grew out of the MemGPT research project from UC Berkeley. MemGPT proposed a tiered architecture borrowed from operating systems: a small "core" context that acts like CPU cache, an "archival" store that acts like RAM, and a vector index for semantic retrieval. The agent decides what to keep in core context and what to push to archival, writing explicit calls like core_memory_replace() as part of its action loop. Letta now offers a framework for building, inspecting, and deploying agents with multi-level memory, with both open-source and managed deployment paths.

Mem0

Mem0 is a drop-in memory layer with a hybrid architecture: vector store for semantic search, graph store for relationship reasoning, key-value store for direct lookups. The platform extracts facts from conversations automatically, classifies them, and routes them to the appropriate backend. Storage is pluggable (Pinecone, Neo4j, others). Mem0 also publishes research on memory-aware LLM evaluation.

Zep

Zep built Graphiti, a temporal knowledge graph engine that tracks not just facts but when those facts were true. Graphiti uses a bi-temporal model: transaction time (when the fact was learned) and valid time (when the fact was true in the world). That lets agents query historical state and avoid the "user once said coffee, now says tea" contradiction problem. Zep reports strong results on the Deep Memory Retrieval benchmark relative to MemGPT.

LangMem

LangMem is LangChain's lightweight SDK for long-term memory in LangGraph agents, released in early 2025. It ships pre-built tools for extracting procedural, episodic, and semantic memories, a background manager that consolidates memories over time, and integration with LangGraph's long-term memory store. Storage-backend-agnostic, which makes it a reasonable choice if you are already invested in LangChain.

Cognee

Cognee frames itself as a memory control plane. A unified layer for building knowledge graphs from conversational data. Cognee ingests from 30+ sources (Notion, Slack, email, S3), enriches with embeddings and relationship extraction, and exposes four operations: remember, recall, forget, improve. The "memify" process continuously prunes stale knowledge and strengthens frequently-used connections.

Supermemory

Supermemory combines a custom vector-graph engine with ontology-aware edges, hybrid vector and keyword search, and automatic ingestion from common tools (Gmail, Drive, Slack). It ranks #1 on three benchmarks: LongMemEval, LoCoMo, and ConvoMem. Also ships a browser extension and an MCP server, which makes memory accessible to any compatible agent.

Evaluating memory

How do you measure whether an agent is remembering the right things? The honest answer: poorly, and the field knows it.

LongMemEval, published in 2024, was the first serious attempt. It tests five abilities: information extraction (recalling specific facts from long histories), multi-session reasoning (synthesizing across separate conversations), temporal reasoning (understanding when things happened), knowledge updates (correcting itself when facts change), and abstention (knowing what it does not know). The benchmark embeds 500 curated questions in realistic chat histories spanning 115K tokens at the short end and up to 1.5M tokens at the long end. Even GPT-4o lands around 30–70% accuracy depending on the slice, which gives you a sense of how unsolved this is.

LoCoMo and ConvoMem cover overlapping ground from different angles. None of them measures usefulness in production, where the question is whether memory actually improved the user experience, not whether retrieval was technically correct.

In practice, teams evaluate memory through retrieval accuracy (did the system return the fact you stored?), behavioral change (did the agent's next response reflect what it learned?), temporal consistency (after a contradiction, does the agent know the current truth?), and context efficiency (did memory reduce the need to pass long history every turn?). Observability tools like LangSmith can log memory operations. Automated evaluation of what should have been remembered remains mostly manual.

Common questions

Isn't memory just RAG with a vector store?

No. RAG reads from a fixed external corpus. Memory is dynamic state the agent writes to as it learns. You can build memory on top of a vector store, but the distinction is direction: RAG is read-only against authoritative content; memory is read-write against the agent's own experience. Production systems use both, for different jobs.

Do I need memory for a short-running agent?

Probably not. If your agent handles single-turn or within-session interactions, short-term context history is enough. Long-term memory pays off when the agent needs to recognize returning users, track multi-session goals, or adapt to a specific person over time. A chatbot handling 100 independent queries a day does not need it. A personal assistant working with the same user for weeks does.

Can I just use a 1M-token context window instead of building memory?

Not for long-running agents. A 200K or 400K context sounds large until you do the math: six months of daily conversations with one user runs into millions of tokens. Stuffing all of it into every call is expensive and wasteful, because most of it is irrelevant to the current turn. Memory systems exist to retrieve the right slice. Long context and memory are complements, not substitutes.

How do I evaluate whether my memory system is working?

Start with a manual test loop. Insert a fact via the agent, pause, query memory directly to confirm storage, then resume the agent in a fresh session and ask about that fact. If it recalls correctly, retrieval works end to end. Then add harder cases: multi-hop queries that require combining two facts, temporal queries that ask whether something was true at a specific time, and behavioral checks that test whether agent decisions actually shift based on memory. Formal benchmarks like LongMemEval exist if you want to compare across systems, but they require non-trivial setup.

Vector, graph, or hybrid? How do I choose?

Start with vectors. They are simpler and fast, and most queries are "find facts similar to this." Add graph reasoning if you discover you have multi-hop questions that vectors handle badly: "find people this user knows who work in fintech." Hybrid systems like Mem0, Zep, and Cognee combine both. Pick a hybrid system from day one if you already know your queries are relationship-heavy.

This post originally appeared on tokenjam.dev/blog. It's part of a 14-post series on the agentic AI ecosystem.

What is Agent Evaluation?

Anil Murty — Thu, 14 May 2026 04:58:59 +0000

This post originally appeared on tokenjam.dev/blog.

Agent evaluation is the practice of measuring whether an AI agent does what it's supposed to do, repeatedly, on diverse inputs: across multi-step trajectories, tool use, and open-ended outputs that traditional ML evaluation doesn't capture. Single-turn language model evaluation grades one output. Agent evals must verify that an agent can navigate complex environments and call the right tools at the right time. They also need to confirm the agent recovers from failures in its own reasoning.

Why traditional ML evaluation falls short

Traditional machine learning evaluation was built around static inputs and single-turn outputs. A classifier either predicts the correct label or it doesn't. An LLM either generates the right summary or it doesn't.

Agents are different. An agent's behavior unfolds over many steps: it observes an environment, makes a decision, takes an action, observes the result, and repeats. A single step can be correct while the overall trajectory fails. An agent might call the right tool and pass it the wrong arguments. It might retrieve information correctly and then fail to synthesize it into an answer. It might get stuck in a loop and never terminate.

Agents need their own evaluation regime. Traditional eval frameworks miss what matters:

Multi-step trajectories: A single correct output isn't enough if the path to get there involved hallucinating intermediate steps, calling tools redundantly, or exploring dead ends.
Tool use: Did the agent call the tool? Did it use the output? Did it handle errors gracefully? Did it know when not to use a tool?
Open-ended outputs: Many agent tasks don't have a single correct answer. Evaluation must grade on relevance and task completion, not on string matching.

Categories of evaluation

Agent evaluation splits into four categories along two rough axes: whether the tasks come pre-built or you write your own, and whether the tooling is public/OSS or commercial/managed. Most production teams use two or three of these at once.

	Pre-built tasks (run their tests against your agent)	Your own tasks (write the tests yourself)
General-purpose / OSS	Capability benchmarks	OSS eval frameworks
Specialized / Managed	Domain benchmarks	Commercial eval platforms

Capability benchmarks

Capability benchmarks are pre-built, general-purpose task suites designed to measure what agents can do at all. You point your agent at them and get a score comparable to public leaderboards:

GAIA: 466 reasoning tasks (economic data lookup, currency conversion, multi-step logic) with access to web tools, file parsers, and calculators. Answers are graded as exact string matches against human-annotated ground truth.
SWE-bench: 2,294 GitHub issues across 12 popular Python repositories. The benchmark measures whether agents can identify bugs, propose fixes, and pass test suites. SWE-bench Verified, a human-validated subset, contains 500 curated samples.
WebArena: A self-hostable sandbox with websites mimicking real services (e-commerce, map search, content management). Agents control a simulated browser to complete tasks like booking flights or updating accounts.
OSWorld: Agents receive a desktop screenshot and a natural language instruction. They must interact via mouse and keyboard, producing new screenshots with each action. The benchmark tests GUI understanding and navigation.
TAU-bench: Simulates customer service interactions (airline, retail, telecom, manufacturing) where agents must handle real-time conversations, use domain-specific APIs, and follow business guidelines. Includes both text and voice modalities.
TerminalBench: Agents interact with a shell environment, executing commands and navigating file systems to complete coding and system administration tasks.

Domain benchmarks

Domain benchmarks are pre-built test suites specialized for industries where the stakes (and the task texture) differ from general-purpose evals:

Vals AI: Benchmarks legal and financial AI agents on domain-specific tasks. The legal evaluation tests document Q&A, redlining, transcript analysis, and legal research. The finance benchmark evaluates agents on 537 questions covering retrieval, market research, and financial modeling.
Coval: An evaluation platform for voice and chat agents that simulates thousands of real-world scenarios and measures performance on domain-specific metrics (latency, conversation quality, goal completion) alongside voice-specific measurements (STT accuracy, TTS clarity).

Open-source eval frameworks

OSS eval frameworks don't ship benchmark tasks. They ship the machinery (test runners, scoring functions, LLM-judge helpers) so you can write evals for your use case in code you control:

Inspect AI (UK AI Safety Institute): A Python framework for scripted evals with tool calls and model-graded rubrics. Includes 200+ pre-built evaluations ready to run on any model.
DeepEval: An LLM-evaluation framework similar to Pytest, offering metrics like G-Eval, task completion, answer relevancy, and hallucination detection. Runs locally without external services.
Promptfoo: A CLI and Node.js tool for testing and red-teaming LLM applications. Tests are defined declaratively in YAML, making it easy to compare models and harden prompts against adversarial inputs.
RAGAS: A Python library for evaluating retrieval-augmented generation (RAG) pipelines. Provides reference-free metrics for retrieval quality and generation quality, integrating with LangChain and other frameworks.

Commercial eval platforms

Commercial eval platforms cover the same "you bring the tasks" workflow as OSS frameworks, with managed infrastructure, dashboards, dataset versioning, and CI/CD integration:

Braintrust: A managed eval and observability platform with SDK wrappers for the OpenAI Agents SDK, LangGraph, LangChain, and CrewAI. Bundles eval definitions, scoring, and trace storage in a single hosted product.
Galileo: Focuses on building a reliability stack for complex agents. Its Luna evaluation models compress expensive LLM-as-judge evaluators into compact models that run at sub-200ms latency with significantly lower cost.
Maxim: An evaluation and observability platform emphasizing agent simulation. Teams can simulate agent behavior across hundreds of scenarios before production, then monitor quality in real time after deployment.
Patronus: Provides runtime guardrails and evaluation for production agents, focusing on safety and compliance.
Confident AI: An LLM evaluation framework that specializes in LLM-as-judge grading and evaluation pipeline management.

A practical observation cuts across all four categories: they don't connect to each other. Capability benchmark scores rarely flow into production monitoring. Production observability rarely surfaces eval regressions. Domain benchmarks live in their own dashboards. Commercial eval platforms increasingly bundle eval and production tracing, and even there most teams still wire the bridge themselves with scripts and shared spreadsheets.

The LLM-as-judge pattern

LLM-as-judge uses an LLM (often with an evaluation prompt and rubric) to grade agent outputs. It's fast to set up and works well in specific domains.

Where it works:

Factual correctness (e.g., "Is this fact accurate?")
Format compliance (e.g., "Does the output follow the required schema?")
Simple preference comparisons (e.g., "Is response A better than response B?")

Where it fails:

Subjective quality judgments (what constitutes a "good" explanation is context-dependent)
Novel reasoning (judges can't reliably grade reasoning steps they don't understand)
Multi-step coherence (judges may miss subtle logical inconsistencies)
Grading the graders (different LLMs disagree on grades, introducing inconsistent scores)

The honest tradeoff: LLM-as-judge is cheap and fast, and it introduces a new failure mode. If your judge is miscalibrated, you'll optimize your agent toward the judge's biases, not toward the actual task. Particularly dangerous in high-stakes domains (legal, financial, healthcare) where judge errors compound across decisions.

Benchmark gaming

In April 2026, researchers at UC Berkeley's Center for Responsible, Decentralized Intelligence published findings that all eight major agent benchmarks could be reward-hacked to near-perfect scores without solving the actual tasks. Their exploit agent achieved:

SWE-bench Verified: 100% (500/500) via a 10-line conftest.py that hooks into pytest and rewrites test results to "passed"
WebArena: ~100% by navigating to file:// URLs and reading answers directly from the task config
GAIA: 98% through answer leakage via public databases and answer normalization collisions
OSWorld: 73% through direct environment manipulation
CAR-bench, FieldWorkArena, Terminal-Bench: 100%

The paper, "How We Broke Top AI Agent Benchmarks," documents vulnerability patterns systematically. The Berkeley team open-sourced their exploit toolkit as a diagnostic tool for benchmark maintainers, signaling that traditional benchmarks have fundamental measurement problems.

This echoes a broader pattern. Whenever a metric becomes a target, it ceases to be a good metric. Goodhart's law applies to agent benchmarks just as it does to college admissions tests or corporate KPIs.

Production eval vs offline eval

Most teams discover the gap between benchmark performance and production reliability only after deployment. A 90% GAIA score doesn't guarantee a 90% success rate in production. Three reasons.

Distribution shift: Benchmark tasks are curated and balanced. Real-world agent queries are noisy, ambiguous, and adversarial. Agents that did well on clean benchmark data see scenarios in production they've never encountered.

Environment variability: Benchmark environments are static and deterministic. Production environments have latency, failures, rate limits, and unexpected state changes. An agent that succeeds 95% of the time on a clean WebArena instance might succeed 60% of the time when handling real web services with occasional downtime.

Feedback loops: Benchmarks are evaluated once, offline. In production, failures compound. If an agent makes a mistake, downstream actions amplify the error. Benchmark evals don't capture this cascade.

Sophisticated teams close the gap with multi-layer evaluation:

Shadow evaluation: Run the agent on real production queries but don't act on its outputs. Grade the outputs against human ground truth. This reveals how well the agent generalizes without risk.
Regression evaluation: After an agent ships, run periodic evals on a fixed set of known tasks. This catches drift: did this week's model still handle the tasks it handled last week?
A/B evaluation: Compare agent versions on the same real queries in production. Measure not just task completion but also latency, human intervention rate, and user satisfaction.

Offline benchmarks tell you what your agent could do on curated tasks. Production evals tell you what it actually does in the wild. Most teams have the first and very little of the second, and the bridge between them is where the category is currently weakest.

Common questions

How do I write evals for my agent?

Start narrow. Pick 5–10 representative tasks your agent should handle. Grade them manually or with a human rubric. Then decide: can this be graded programmatically (function call, exit code)? If yes, automate it. If no, build a lightweight annotation interface and grade manually or with a jury of evaluators. Once you have a working eval, expand to 50–100 tasks. Then scale with LLM-as-judge or a pre-built framework like DeepEval or Inspect AI.

Why do my LLM-as-judge scores swing by 20% on the same agent output?

Judge nondeterminism. A few causes. First, sampling: if your judge runs at temperature > 0, repeated calls produce different scores on identical inputs. Set temperature to 0 and seed the model if your provider supports it. Second, the judge prompt is underspecified: if you ask "is this answer correct?" without a rubric, the judge hallucinates criteria differently each time. Pin the rubric with worked examples (good answer A scores 5, partial answer B scores 3, wrong answer C scores 1). Third, position bias: when the judge compares two responses, A vs. B, it tends to favor whichever comes first. Randomize the order on each comparison. Fourth, model drift: a managed LLM judge can change behavior week-to-week as the provider updates the model. Pin the model version (claude-3-5-sonnet-20241022 rather than claude-3-5-sonnet) in your eval config.

My agent passes the eval suite but breaks on the first real user query. What's going on?

Two failure modes that compound. First, your eval set doesn't match the production distribution. Eval tasks tend to be clean and well-specified. Real user queries are messy: misspellings, ambiguous goals, missing context, follow-up questions that reference earlier turns. If your evals are single-turn and your production agent gets multi-turn conversations, the eval was measuring a different thing. Second, your eval set has leaked into the training data or prompt. Some teams iterate on the same eval tasks long enough that the agent (or the prompt) gets overfit to them. Diagnostic: take five real user queries from the last 48 hours, add them to the eval suite, and watch the pass rate drop. The fix is continuous. Pull a fresh sample of real queries each week and add them to the eval set, while keeping a frozen "golden" subset to detect regressions.

How do I connect my offline eval scores to what's happening in production?

Most teams don't, and the gap is the source of a lot of bad surprises. The standard hack: tag each production trace with the model version and prompt version that produced it, then run your eval suite against the same prompt/model combo whenever you ship a change. That gives you correlation, not causation. Enough to catch regressions before users do. The harder version is matching the distribution of production queries (not just rerunning your eval set), which means sampling real queries weekly into your eval suite and watching the pass rate. Teams that do this seriously end up with three concentric eval rings: a frozen golden set (regression detection), a sliding production sample (distribution tracking), and the underlying benchmark (capability ceiling). Few products thread these together cleanly today. Most teams build the connection themselves with scripts and shared dashboards.

This post originally appeared on tokenjam.dev/blog. It's part of a 14-post series on the agentic AI ecosystem.

What is an LLM Gateway?

Anil Murty — Tue, 12 May 2026 17:46:41 +0000

Originally published on tokenjam.dev/blog.

An LLM gateway is a unified API layer that sits between your application and one or more LLM providers, abstracting provider-specific APIs into a single interface and adding cross-cutting concerns like routing, fallbacks, key management, and caching. Instead of writing integration code for OpenAI's SDK, then Anthropic's, then AWS Bedrock's, you write once against a gateway and let it handle the details of talking to each provider.

A gateway acts as a reverse proxy for LLM traffic. It intercepts your requests, enforces policies, applies transformations, routes to the appropriate backend, and logs what happened. Some gateways are thin (basic API translation). Others are thick (guardrails, cost control, agentic features).

Why teams adopt gateways

Provider diversification without integration overhead. Writing per-provider SDKs into production code couples your app to each provider's API shape. A gateway abstracts that away. You call one interface and can swap providers, or add fallbacks, without touching your application code.

Cost control and transparency. Most production LLM usage surprises teams. A gateway intercepts every request and can log cost, tokens consumed, latency, and error rates in one place. Some gateways support per-user budgets and per-model cost caps with real-time spend tracking. When a cheaper model works as well as an expensive one, you can route to it and measure the impact.

Fallback routing when providers fail. If your primary provider is rate-limited or down, a gateway can automatically retry against a secondary provider. This is critical for production agents where "sorry, OpenAI is having issues" is not an acceptable failure mode.

Model swaps without code changes. A/B testing models, rolling out new ones, or deprecating old ones becomes a configuration change in the gateway instead of a deployment. This matters when you're optimizing for latency, cost, or quality and need to experiment quickly.

Simpler developer experience. One API. One set of credentials to manage. One place to understand rate limits and retry behavior. Valuable when many developers and services are calling LLMs.

Core capabilities

Most gateways provide a baseline set of features:

Unified API. An OpenAI-compatible REST API or SDK, so clients don't have to change when you swap providers.

Key management. Credentials for each provider are stored centrally (encrypted at rest), so your application never sees raw API keys. This reduces credential sprawl and the risk of keys leaking in logs or code.

Rate limiting and quotas. Enforce per-user, per-team, or per-model rate limits and budget caps. When limits are hit, the gateway rejects requests gracefully instead of letting them propagate to the provider and incur charges.

Automatic fallback and retries. If a request fails, the gateway can retry the same provider, fall back to an alternate provider, or both. Configurable policies let you set retry counts, backoff strategies, and provider precedence.

Request and response caching. Cache identical or semantically similar requests to avoid redundant API calls. Some gateways support semantic caching (caching based on meaning, not exact string match), which saves cost and latency for agents that reuse context.

Traffic observability. Log every request: latency, tokens (input and output), cost, error messages, provider, model used, user or app identifier. This data is what makes debugging, cost allocation, and performance monitoring possible.

The convergence: gateways and observability

The boundary between gateways and observability platforms has blurred. Many gateways now ship observability features. Many observability platforms now offer proxy-like routing. Conceptually they solve different problems.

Gateways make routing decisions. They choose which provider to call. They decide when to retry, and when to fall back. They're about control: deciding what happens to each request.

Observability measures what happened. It captures latency, tokens, cost, errors, and other metrics. It's about visibility: understanding the system after the fact.

A concrete example. You deploy a gateway configured to route 80% of requests to Claude Opus and 20% to a cheaper model. The gateway controls the routing (gateways). Your observability platform logs each request and shows you that the cheaper model has a higher error rate and longer latency (observability). Armed with that data, you adjust the split back to 90%/10%.

Helicone is the canonical example of a hybrid: a proxy gateway (with routing and fallback logic) and an observability-first platform (with dashboards, evals, and experiment tracking built in). The two functions remain distinct. Many teams run a lightweight gateway (like LiteLLM) alongside a separate observability tool, or vice versa.

Notable tools

LiteLLM is an open-source Python SDK and proxy server supporting 100+ providers (OpenAI, Anthropic, Bedrock, Vertex AI, Cohere, and more) via an OpenAI-compatible interface, with built-in cost tracking, guardrails, load balancing, and logging. It is highly extensible and widely used in production. See litellm.ai.

OpenRouter is a managed gateway exposing 500+ models from 60+ providers through a single OpenAI-compatible API, with intelligent routing based on cost, latency, and availability, plus automatic failover when providers are down or rate-limited. Ideal for teams that want to use many models without managing multiple accounts and keys. See openrouter.ai.

Vercel AI Gateway is a Vercel-native managed gateway supporting bring-your-own-key (BYOK) authentication with no additional markup, tightly integrated with the Vercel AI SDK and built for fast model iteration in production applications. See vercel.com/docs/ai-gateway.

Cloudflare AI Gateway is edge-deployed across Cloudflare's 330-city network, supporting multiple providers (OpenAI, Anthropic, Hugging Face, Bedrock, and more) with caching, rate limiting, retries, and model fallback. Particularly strong for latency-sensitive workloads due to edge placement. See developers.cloudflare.com/ai-gateway/.

Portkey is a managed gateway and production control plane supporting 1,600+ LLMs with enterprise-grade governance (RBAC, SSO, granular budgets), compliance certifications (SOC2, ISO 27001, GDPR, HIPAA), and deployment options (SaaS, hybrid, or air-gapped). Designed for teams with strict security and audit requirements. See portkey.ai.

Bifrost is a high-performance open-source gateway written in Go, optimized for low-latency, high-RPS workloads with roughly 40x lower latency than LiteLLM, supporting 15+ providers and offering adaptive load balancing, clustering, and guardrails. Best for teams that prioritize infrastructure performance over feature breadth. See docs.getbifrost.ai.

Helicone is an open-source hybrid observability and gateway platform with a cloud-hosted API and self-hosted options, supporting 100+ providers with zero-markup pricing and built-in evals, experiments, and monitoring dashboards. Strong choice if you want observability and routing in one platform. See helicone.ai.

Kong AI Gateway is an enterprise API gateway with dedicated AI connectivity features, supporting LLM, MCP, and A2A routing with usage analytics, provider-agnostic routing, and deployment options (Konnect SaaS or self-hosted Enterprise). Designed for large organizations already using Kong for general API management. See konghq.com/products/kong-ai-gateway.

When you need a gateway (and when you don't)

You probably don't need one if:

You are building a toy or hobby project that only calls one provider (e.g., a ChatGPT wrapper for personal use).
You are prototyping and don't care about observability, fallbacks, or cost tracking yet.
Your application is so simple that it doesn't benefit from vendor diversification or cost control.

You almost certainly do need one if:

You are running agents or applications in production, even single-provider, because you want observability and automatic retries.
You use multiple LLM providers and want a single interface without per-provider integration overhead.
You need to control costs: budget per user, per team, or globally; route to cheaper models when quality permits; or measure spend by application.
You want to experiment with models or providers without code changes. Swapping models should be a config update, not a deployment.
You need automatic fallback routing: if provider A is down or rate-limited, automatically try provider B.
You are building multi-tenant applications or services where different customers may have different provider preferences or budgets.

If your LLM use is scattered across multiple services and providers, a gateway pays for itself in visibility and operational safety.

Common questions

Why is my gateway slower than calling the LLM provider directly?

Three common causes. First, the gateway hop adds network latency: every request now goes through an extra service before reaching the provider. A well-deployed gateway adds single-digit milliseconds. A misconfigured one (deployed in a different region than the provider's endpoint, behind a slow load balancer, or running on undersized hardware) can add 50–200ms. Second, retry logic: if your gateway is configured to retry on transient errors, a flaky connection that succeeds on retry will look like the gateway added latency, when really the gateway hid an underlying problem. Third, semantic-cache overhead: a misconfigured cache that runs similarity search on every request can add 20–50ms even when nothing's cached. Profile with the gateway disabled and enabled, compare p50/p95/p99 across the same workload. Most "gateway is slow" complaints turn out to be retries firing on a flaky provider, not the gateway itself.

Why does my gateway keep falling back when the primary provider works fine?

Most fallback policies fire on any error, not only on the ones you care about. The usual culprits: rate-limit response headers from the provider that the gateway misreads as failure; transient timeouts because your gateway's timeout is set tighter than the provider's actual p99; "soft" errors like a malformed completion that the gateway counts as a hard failure. Open the gateway's trace logs and look at which provider got hit, which error fired, and what the policy did with it. Most gateways let you scope fallback policies to specific error codes (5xx and rate-limit responses, but not 4xx auth/validation errors). If the same fallback fires for thousands of requests in a row, it's almost always a misconfigured policy rather than a real provider outage.

Do I need a gateway if I'm only using one provider?

Not strictly. If you use OpenAI exclusively and don't care about observability or fallbacks, calling their SDK directly is simpler. Even single-provider deployments benefit from a lightweight gateway for rate limiting, cost tracking, and automatic retries. Many teams find that a thin gateway (like LiteLLM in passthrough mode) adds minimal overhead and meaningful resilience.

How much latency does a gateway add?

It depends. A well-designed gateway adds single-digit milliseconds (Bifrost claims <100 microseconds overhead). A poorly designed one can add hundreds of milliseconds. If latency-sensitive workloads matter, benchmark the specific gateway and configuration in your environment. Edge-deployed gateways like Cloudflare's can reduce latency by sitting closer to users.

Can I self-host a gateway?

Yes. LiteLLM, Helicone, Bifrost, and Kong are all open-source and self-hostable. Vercel and Portkey offer managed services that also support self-hosted or hybrid deployment. Self-hosting trades operational overhead (you run the infrastructure and handle upgrades) for control and privacy. Many teams start with a managed gateway and self-host later.

What's the difference between a gateway and an API management platform?

API management platforms (Kong, Apigee, AWS API Gateway) are general-purpose tools for managing HTTP API traffic: RESTful services, webhooks, microservices, and the rest. They can be used as gateways for LLM traffic. They aren't LLM-specific. LLM gateways are purpose-built to handle things like model routing, token accounting, provider-specific quirks, and cost tracking. Many teams use an API management platform for general infrastructure and a specialized LLM gateway for LLM traffic.

What is OpenTelemetry, and why does it matter for AI agents?

Anil Murty — Mon, 11 May 2026 19:59:10 +0000

OpenTelemetry is the Cloud Native Computing Foundation's standard for collecting and exporting observability signals (traces, metrics, and logs) from applications. Instead of locking you into a single vendor's telemetry format, OpenTelemetry defines how applications emit telemetry data in a vendor-neutral way, using the OpenTelemetry Protocol (OTLP). You instrument your code once and can send your telemetry to any compatible backend: Datadog, New Relic, Grafana, Jaeger, or any other system that speaks OTLP.

The three components: SDKs, OTLP, and backends

Three moving parts.

1. SDKs (instrumentation in your code)

An OpenTelemetry SDK is a library that runs in your application. It collects traces, metrics, and logs from your code and hands them off for export. You install it, configure it, and call its APIs (or rely on auto-instrumentation) to emit telemetry. For Python agents, the OpenTelemetry Python SDK is the foundation. For TypeScript, OpenTelemetry JavaScript serves the same purpose.

SDKs do the heavy lifting. They manage span lifecycle, batch telemetry, apply sampling policies, and handle backpressure when backends are slow. Different instrumentation libraries (for LangChain, Anthropic, Ollama, and others) sit on top of an SDK and emit standardized spans into it.

2. OTLP: the wire protocol

OTLP (OpenTelemetry Protocol) is how telemetry gets from your SDK to a backend. OTLP runs over gRPC or HTTP/1.1, uses Protocol Buffers for encoding, and specifies backpressure handling and retry semantics. You don't think about OTLP directly. It's configured via environment variables like OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS. It's the contract between your SDK and any backend that claims OpenTelemetry support.

3. Collectors and backends

An OpenTelemetry collector is a standalone service that receives telemetry data via OTLP and routes it to backends, applies transformations, and handles batching at scale. A backend (Datadog, Grafana Loki, Jaeger, Honeycomb, and others) stores and queries your traces. You can skip the collector for small workloads. Many apps export directly to a cloud backend via OTLP. Collectors give you flexibility: they let you filter and enrich telemetry before it hits your backend, and they buffer data when backends are slow.

Why agents need OpenTelemetry specifically

Vendor lock-in is real for observability. If you instrument your agent to emit telemetry in Datadog's proprietary format, switching to New Relic means rewriting instrumentation across your codebase. For organizations with many agents and teams, this tax is enormous.

OpenTelemetry fixes this by making the instrumentation the constant, not the vendor. Your agent code emits OTLP. Your backend is the variable. You can migrate backends, or use multiple backends simultaneously, without touching your instrumentation layer.

This matters more for agent teams because agent complexity is growing. A modern agent traces LLM calls, tool invocations, retrieval steps, and agent reasoning across multiple frameworks and runtimes. A shared observability standard means you're not training teams to emit telemetry differently for each agent tool; they all follow the same conventions.

The GenAI semantic conventions

OpenTelemetry includes a specification for semantic conventions: standardized attribute names and meanings that make spans interoperable across backends. For generative AI, the OpenTelemetry GenAI semantic conventions define how to structure traces from LLM calls and agent steps.

Key attributes include:

gen_ai.system: The GenAI system or LLM provider (e.g., openai, anthropic, ollama).
gen_ai.request.model: The name of the model being invoked (e.g., claude-3-5-sonnet).
gen_ai.operation.name: The operation type (e.g., chat, completion, embedding).
gen_ai.usage.input_tokens and gen_ai.usage.output_tokens: Token counts from the LLM response.
gen_ai.response.id: The response ID from the model provider.
gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.version: Identity and version of the agent.
gen_ai.conversation.id: Unique identifier for a conversation thread (for multi-turn traces).

These conventions are actively evolving; the specification is not frozen. That's by design. As new use cases emerge (tool use, function calling, retrieval-augmented generation, multi-agent coordination), the spec grows. Tools that adopt the conventions now benefit immediately. They gain interoperability across backends even as the spec matures.

Adopting these conventions in your agent instrumentation means any OpenTelemetry-aware backend can parse and query your traces without custom parsing logic. You get consistent dashboards and analytics across vendors.

How real agent runtimes emit OTel today

OpenTelemetry adoption in the agent ecosystem is accelerating. Concrete examples:

Claude Code

Claude Code natively emits OpenTelemetry traces when you set the CLAUDE_CODE_ENABLE_TELEMETRY=1 environment variable. You then configure where traces go using standard OTEL environment variables:

export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend.example.com:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc

For full configuration details, see the Claude Code environment variables documentation.

LangChain

LangChain supports OpenTelemetry instrumentation via the opentelemetry-instrumentation-langchain package. You instrument your LangChain app and export via any OTLP-compatible backend:

from opentelemetry.instrumentation.langchain import LangchainInstrumentor

LangchainInstrumentor().instrument()

Traces follow the GenAI conventions, so your LangChain chains are observable across any OpenTelemetry-aware platform.

LlamaIndex + OpenInference

LlamaIndex integrates with OpenInference, a set of conventions built on top of OpenTelemetry for AI observability. OpenInference spans are valid OTLP traces, so you get the same portability as native OpenTelemetry.

OpenLLMetry by Traceloop

OpenLLMetry is a collection of OpenTelemetry instrumentations for LLM apps. It provides ready-made instrumentation for LangChain, Anthropic, Ollama, Pinecone, Qdrant, and many other LLM-adjacent services. Because it's built on OpenTelemetry, any instrumentation you install works with any OTLP backend.

Notable tools and SDKs

The OpenTelemetry ecosystem for agents includes:

OpenTelemetry Python SDK: The core SDK for Python agents. Use this as your foundation for any Python-based agent instrumentation.
OpenTelemetry JavaScript SDK: The equivalent for Node.js and browser-based agents.
OpenLLMetry: Pre-built instrumentations for LangChain, Anthropic, OpenAI, LlamaIndex, Ollama, Qdrant, and others. Reduces boilerplate if your agent uses popular frameworks.
OpenInference by Arize: A semantic convention and instrumentation set for AI workloads. Integrates with OpenTelemetry and works with any OTel backend, including Arize Phoenix, Jaeger, and Datadog.
Phoenix by Arize: An open-source observability tool for ML and LLM apps that consumes OpenInference (and thus OpenTelemetry) traces.
Collector distributions: OpenTelemetry Collector is the standard. Vendor-specific distributions (e.g., Datadog Agent, New Relic Agent) also speak OTLP.

Common questions

Why are my LLM calls showing up as HTTP spans instead of GenAI spans?

You probably have base HTTP instrumentation without an LLM-aware layer on top. The default OpenTelemetry HTTP instrumentation captures your LLM API calls as plain HTTP spans (POST /v1/messages, 200 OK, 142ms). They show up. They're just missing the actually-useful attributes: model name, token counts, response ID. To get GenAI semantic-convention spans, install an LLM-aware instrumentor: OpenLLMetry's Anthropic or OpenAI instrumentor, OpenInference, or use a framework that emits GenAI spans natively (Claude Code, LangChain via its OTel package). Install the instrumentor (e.g., opentelemetry-instrumentation-anthropic from OpenLLMetry) and initialize it before your code creates the LLM client. After that, calls to client.messages.create() should produce gen_ai.* spans alongside the HTTP spans, and you can filter on gen_ai.system in your backend.

Which OpenTelemetry SDK should I use with my agent framework?

Depends on your language and framework. Python agents use the OpenTelemetry Python SDK. If you're on LangChain, LlamaIndex, or another framework, look for that framework's OTel instrumentation package first (via OpenLLMetry or framework-native support). If no instrumentation exists, you can hand-instrument your code using the SDK directly.

What's OTLP?

OTLP is the OpenTelemetry Protocol: the wire format and transport mechanism for sending telemetry data from your SDK to a collector or backend. It's built on Protocol Buffers and runs over gRPC or HTTP/1.1. You don't configure OTLP directly. You set environment variables like OTEL_EXPORTER_OTLP_ENDPOINT to point your SDK at a backend.

Does setting CLAUDE_CODE_ENABLE_TELEMETRY=1 send my data to Anthropic?

No. The flag tells Claude Code to emit OpenTelemetry traces to whatever OTLP endpoint you configure via OTEL_EXPORTER_OTLP_ENDPOINT. If you don't set an endpoint, the SDK has nowhere to send them and they're dropped on the floor. Anthropic doesn't receive your traces from this path. That's distinct from Anthropic's usage-and-billing telemetry, which is sent to Anthropic regardless of the OTel flag because it's how the API gets metered. The OTel data is for you: send it to Datadog, Grafana, a local Jaeger, or wherever you run observability. See the Claude Code env-vars docs for the full configuration list.

How do I set up telemetry export in my agent?

Standard pattern:

Install the OpenTelemetry SDK for your language.
Install instrumentation packages for your frameworks (LangChain, Anthropic, and so on).
Initialize the instrumentation in your agent startup code.
Set OTEL environment variables to point at your backend:
- OTEL_EXPORTER_OTLP_ENDPOINT: Your backend's OTLP endpoint.
- OTEL_EXPORTER_OTLP_PROTOCOL: grpc or http/protobuf.
- OTEL_EXPORTER_OTLP_HEADERS: Auth headers, if needed.

See your backend's documentation for the specific OTLP endpoint URL.

Can I use OpenTelemetry with multiple backends simultaneously?

Yes. Configure multiple exporters in your SDK, or use an OpenTelemetry Collector to fan telemetry out to multiple destinations. Common during a backend migration, or when you want redundancy.

What is Agent Observability?

Anil Murty — Mon, 11 May 2026 19:52:34 +0000

Originally published on tokenjam.dev/blog.

Agent observability is the practice of capturing what an AI agent did (its tool calls, token costs, behavioral patterns, and outcomes) at a level of detail sufficient to debug, optimize, and audit agent behavior in production. You record the agent's full journey: every decision point, every tool invocation, every LLM call with inputs and outputs, latencies, costs, and errors. Service observability captures what your code did. Agent observability captures the reasoning chain itself: the sequence of thoughts and decisions that led the agent to act.

Why agent observability is harder than service observability

Service observability is built on a predictable model. A request comes in, your code executes a series of steps, a response goes out. Each step is deterministic. Logs tell you what happened. Metrics tell you how long it took and whether it succeeded.

Agents break this model.

Nondeterminism is the core problem. The same input to an agent with the same model and parameters might produce different outputs on different runs. The LLM samples from a probability distribution. You can't debug an agent from logs alone. You have to capture the complete trace of that specific run to understand what reasoning led to that specific output.

Tool calls are deeply nested. A service call stack might be five or ten levels deep. An agentic system can have an agent call a tool, which triggers a retrieval operation, which calls an embedding model, which calls a database, which triggers another tool. The nesting is deep and irregular. A trace that doesn't capture every step in this chain will miss the real bottleneck.

Prompts and completions are your actual data. In a service, your data is SQL queries and JSON payloads. In an agent, your data is the prompt sent to the LLM and the completion it returned. These are large and unstructured. They're often sensitive: they contain user context, proprietary information, internal state. Traditional logging systems don't handle this well. Observability for agents has to be built around capturing and safely storing these artifacts.

The vocabulary didn't exist three years ago. Terms like "token usage," "tool selection," "context window," and "hallucination" are specific to the agentic context. Existing APM (application performance monitoring) tools (Datadog, New Relic, Dynatrace) were built for microservices. They have no native concept of an LLM call, a token count, or a tool invocation. Shoehorning agent data into these systems works. It's also awkward.

The three pillars, adapted for agents

Observability has three pillars: traces, metrics, logs. The definitions shift when you apply them to agents.

Traces capture the complete execution path of a request. In a microservice, a trace is a sequence of function calls and RPC hops. In an agent, a trace is the agent's full journey: the user input, each LLM call (with prompt and completion), each tool invocation and result, latency at each step, token usage at each step, and the final output. A trace is the highest-fidelity record you have. It answers questions like "Why did the agent choose tool X instead of tool Y?" or "Where did the latency spike occur?"

Metrics are aggregations: counts and percentiles. In services, you track request latency, error rate, throughput. For agents, you track cost per request (sum of token usage × model pricing), latency per LLM call, tool invocation frequency, error rates (both LLM errors and tool errors), and token efficiency (useful output tokens vs. wasted context). Metrics let you spot trends over time and set up alerts when something goes wrong at scale.

Logs are raw events: "This LLM call failed," "Token limit exceeded," "Tool returned an error." In a service, logs focus on errors. In an agent, logs are also informational: "Agent selected tool X." "Retry attempt 2 of 3." Logs are lower resolution than traces. They're faster to query and more storage-efficient.

What you actually capture

A production-grade agent observability system captures:

LLM calls: Model name, parameters (temperature, max_tokens, top_p), the prompt sent, the completion received, token counts (input and output), latency, cost, success or failure. This is the core of agent observation.
Tool invocations: Tool name, input parameters, output, latency, whether the tool succeeded or failed, and any retry information. Tools are where your agent touches the outside world. They cause most of your latency and most of your errors.
Token usage per call: Not just total tokens consumed. A breakdown: how many tokens in the context window, how many in the prompt, how many in the response. This helps you optimize context and identify tokens wasted on irrelevant context.
The agent's reasoning chain: The intermediate thoughts or justifications the agent produced at each step. Some LLM frameworks (like ReAct) explicitly generate these; others encode them implicitly. Capturing this chain is what lets you debug why an agent made a particular decision.
Model and parameters: Which model was used, which version, what temperature and sampling parameters. This matters because the same agent with different parameters can behave very differently.
Errors and retries: When a tool call failed, did the agent retry? How many times? Did it eventually succeed or give up? This tells you if your agent is robust or brittle.
Latency per layer: Total latency is a sum of LLM latency + tool latency + overhead. Breaking this down tells you where to optimize.

These signals should conform to the OpenTelemetry semantic conventions for generative AI. The conventions define a standard schema for representing LLM calls, tool use, embeddings, and agent systems in trace data. Adopting the standard means your agent traces can be ingested by any OpenTelemetry-compatible backend (Jaeger, Datadog, Elastic, or a custom system) without vendor lock-in. See What is OpenTelemetry for AI agents? for a deeper dive.

Common questions

Why does my trace show 47 LLM calls when I only invoked the agent once?

Three common causes. First, the framework you're using (LangChain, LlamaIndex, AutoGen, CrewAI) might be making nested chains where each "step" is itself an LLM call: a planning call, an action call, a reflection call, a synthesis call. A single user request fans out fast. Second, retries: if a tool call returns an unexpected error or the LLM produces malformed output, many frameworks silently retry with backoff, multiplying calls. Third, agent loops: if the agent can't converge on an answer, it keeps reasoning and acting until it hits a max-iteration limit. Open the trace tree and look at timestamps. Tightly clustered calls with the same model and parameters mean retries. Spread-out calls with different prompts mean the framework is decomposing the task more than you expected.

My agent traces are 50MB each. Should I be worried?

Yes, in a specific way. Trace size is dominated by prompt and completion text. A 50MB trace means you're sending massive prompts to the LLM: huge system prompts, retrieved documents, long conversation history, included file contents. The cost is real: that's a lot of input tokens per call. The performance hit is also real because most trace UIs struggle to render or query traces above ~10MB. Two fixes work. First, reduce what you put in the prompt: tighter system prompts, smarter retrieval, summarize conversation history rather than passing it raw. Second, configure your observability tool to truncate long fields above a threshold (Langfuse, Arize Phoenix, and Datadog all support this). Truncated traces are still useful for navigation, and you can fetch the full prompt from your application logs if you actually need it.

Can I use my existing APM (Datadog, New Relic) for agents?

Partially. Datadog and New Relic have built LLM modules onto their existing platforms. They work. They weren't designed for agents from the ground up. They're better at capturing that an LLM call happened than at capturing the reasoning chain or the interaction between multiple tool calls. If you're already in Datadog, LLM Observability is a reasonable choice. If you're starting fresh, a tool built for agents will give you more signal.

What should I capture in production agent traces?

Start with: every LLM call (prompt and completion), every tool invocation (name and result), latency per call, total token usage, and final outcome (success or failure). Add error details if the agent failed. Once that's stable, add cost breakdown per model and tool selection reasoning. Don't try to capture everything on day one.

How do I avoid storing sensitive data in traces?

Most tools support redaction: marking which fields should not be logged (API keys, user PII, secrets). Some (like Datadog LLM Observability) ship with automatic PII detection. Build redaction into your SDK wrapper early; it's easier to add than to retrofit. Also consider sampling. You don't need to trace every request, just a statistically significant sample.

How much overhead does observability add?

Good observability SDKs are asynchronous. Traces are queued locally and sent in batches in the background, so they add minimal latency to your agent's response time. Expect overhead of 5–15% at the p99, depending on the tool and your stack. That's a worthwhile trade-off for production visibility.

Agents 101: Reasoning, Actions & Autonomy

Anil Murty — Mon, 11 May 2026 19:30:52 +0000

What is an AI agent?

An AI agent is a system that uses a large language model to make decisions and take actions in pursuit of a goal. It calls tools, observes what they return, and iterates until the goal is reached. A chatbot waits for the next message; an agent plans and executes its own sequence of steps.

Why it matters

The term entered the mainstream in late 2022, when projects like AutoGPT showed that LLMs could direct their own execution. The concept wasn't new. Researchers had been studying goal-directed autonomous systems for decades. What changed was accessibility: capable base models (GPT-4, Claude) and standardized tool-calling APIs made it practical to build a working agent in a few dozen lines of code.

The word now gets used loosely. Some vendors call a chatbot with a search feature an agent. Others claim that any LLM inference with retrieval is "agentic." This inflation matters. It obscures what's actually new and what's repackaging. Precision helps you know what you're building or evaluating.

Agents represent a shift in how LLMs are deployed. The old model: user asks a question, system returns an answer, conversation ends. Agents invert that. The system receives a goal, decides on sub-goals, gathers information, corrects itself, and iterates without waiting for permission between steps. New architecture. New error handling. New thinking about safety and observability.

Agents vs. chatbots vs. workflows vs. traditional AI

A quick way to distinguish these four categories is to ask: does it use an LLM to decide what to do next? And can it call tools to act on those decisions?

Chatbots use an LLM to generate text. They don't call tools, and they don't pursue goals across steps. A customer-service chatbot answers your question. It doesn't modify your account or call internal APIs unless you ask. Even then, it tends to suggest options or retrieve data rather than decide and act. The LLM's job is to understand and respond.

Workflows call tools and pursue goals. They don't use an LLM to decide which tool to call or how to interpret the result. A workflow might be: fetch customer data, run a validation rule, log an event, send an email. Each step is predefined. Branching is rule-based. The LLM is not in the loop. Workflows are predictable and cheap. They break when the task is ambiguous or open-ended.

Agents combine both. The LLM observes the current state and decides which tool to call next. It adapts and self-corrects as it goes. If a tool call fails, the agent reasons about why and tries something else. The flexibility costs you something. Agents are less predictable, more expensive per inference, and harder to debug. The reward is open-ended tasks, where the path isn't predetermined.

Traditional AI/ML systems (classifiers, regressions, recommenders) optimize a fixed function learned from data. They have no LLM, and they don't pursue multi-step goals. They are specialized and efficient. Generalizing to a new task means retraining.

The table below summarizes the differences:

Aspect	Chatbot	Workflow	Agent	Traditional ML
Uses LLM to decide next step?	No (generates text)	No (follows rules)	Yes	No
Calls tools?	Rarely; usually retrieval only	Yes; predefined sequence	Yes; chosen by LLM	No
Pursues multi-step goal?	No (responds to input)	Yes; fixed path	Yes; adaptive path	No
Handles ambiguous tasks?	Moderate (can discuss)	Poor (requires rigid structure)	Good (can reason and adapt)	Poor

The ReAct pattern and core components

Most agents built since 2023 follow a pattern called ReAct (Reasoning and Acting), introduced in Yao et al.'s 2022 paper from Google Research and Princeton. The idea is straightforward. The LLM produces reasoning steps (thinking aloud about what it needs to do) interleaved with actions (tool calls). It observes the result, then reasons further.

A ReAct loop looks like this:

Observation: The agent observes the current state (the original goal, prior tool results, conversation history).
Reasoning: The LLM thinks through the problem: "I need to fetch the user's account, check their history, then decide whether to approve the request."
Action: The agent calls a tool, say fetch_account(user_id).
Observation: The agent receives the result and feeds it back to the LLM.
Loop: The LLM reasons again, decides on the next action, and repeats until it either reaches the goal or determines that the goal isn't achievable.

The pattern works because the reasoning traces make the LLM's decisions interpretable. You can see why it chose an action. They also enable self-correction: if a tool result is unexpected, the LLM can reason about what went wrong.

An agent's core components are:

The LLM (reasoning engine): Decides what action to take based on the goal and current state. The decision-making layer.
Tools (action layer): Functions the agent can call: APIs, database queries, code execution, web searches, file operations. Tools are how the agent affects the world.
Context and memory (state): Everything the agent knows: the original goal, conversation history, prior tool results, and any persistent state it needs. Without good memory management, agents hallucinate and repeat mistakes.
Control loop (orchestration): The code that runs the loop. It calls the LLM, parses the output for tool calls, executes them, and feeds results back. Modern frameworks (Anthropic's Claude SDK, LangChain, LlamaIndex) handle this. You can also implement it from scratch.

Levels of autonomy

Agents exist on a spectrum. On one end are suggestion-based copilots that nudge you. On the other are autonomous systems that run unattended for hours.

Copilot mode (suggestion): The agent observes what you're doing and suggests the next action. You approve before it executes. Example: Cursor's autocomplete suggests the next line of code; you hit Tab to accept or Escape to reject. The model is doing some reasoning. You stay in control of execution.

Agentic mode (supervised autonomy): The agent makes and executes decisions within a scope you define. You might say "add tests for this file" and the agent writes tests, runs them, and shows you the result, all without asking permission between steps. You can pause or override at any point. Example: Claude Code in an IDE, or an agent working a bounded coding task. The agent is autonomous within the scope, not globally.

Autonomous agent (unattended): The agent pursues a goal with minimal human oversight. You set a goal ("reduce our average response time by 10%") and the agent decides what to measure, what to try, what to roll back, and what to keep. It might run for days, making changes and watching outcomes. Example: an agent managing an experimentation platform, or optimizing an ad-bidding algorithm. These are rare and tend to be domain-specific. The cost of mistakes is too high for general-purpose deployment.

Notable tools

Here are some widely used agent runtimes and frameworks, current as of 2026:

Claude Code (anthropic.com/product/claude-code): Anthropic's agentic coding tool in the terminal, IDE, and browser. Understands your codebase, executes tasks, and handles git workflows.
Cursor (cursor.com): AI code editor with agent mode. Autonomously explores your codebase, edits files, runs tests, and implements features.
OpenHands (openhands.dev): Open-source autonomous agent for software engineering. Runs in a Docker sandbox, can execute complex tasks end-to-end, and publishes pull requests.
Aider (aider.chat): Open-source AI pair programmer for the terminal. Works with your git workflow, supports multiple LLM providers, and commits changes automatically.
Continue (continue.dev): Open-source IDE extension for VS Code and JetBrains. Offers autocomplete, chat, and agent modes, works with any LLM provider.
AutoGPT (agpt.co): Open-source autonomous agent framework, released in 2023. Pioneering example of general-purpose agent architecture; known for demonstrating both promise and limitations of autonomous systems.

Common questions

How is an agent different from a chatbot?

A chatbot responds. An agent pursues. Ask a chatbot "book me a flight" and it asks clarifying questions, then waits for you to confirm. Ask an agent and it gathers options, checks your calendar, considers your budget, and books, without asking permission between steps. The chatbot reacts. The agent acts.

What's the difference between an agent and a workflow?

A workflow is a fixed sequence of steps determined in advance. You define "do A, then B, then C, with these rules for branching." A workflow always takes the same path for the same inputs. An agent reasons about which steps to take and in what order, adapting based on intermediate results. Workflows are predictable and efficient. Agents trade predictability for flexibility.

Why does my agent keep calling the same tool five times in a row?

That's a loop, and the LLM probably doesn't recognize what the tool returned as the answer it was looking for. Common causes: the tool returned an error and the agent retried with the same inputs; the response shape was different from what the LLM expected, so it kept trying; the system prompt left the goal vague enough that the LLM thrashes between candidates. Fixes that work: clearer descriptions in your tool schema, explicit error messages from the tool ("not found" rather than null), and a hard call-count budget so the loop terminates rather than burning tokens.

How autonomous do agents actually get?

Depends on the task and the risk. In low-risk domains (code suggestions, documentation), agents run nearly unsupervised. In higher-risk domains (financial transactions, customer-facing decisions), agents operate under constraints: bounded scope, human review loops, or escalation to a human when confidence is low. Most production agents are supervised autonomy, not full autonomy.

Is it normal for a single Claude Code session to cost $40?

Not normal, not rare. A long session that maintains a big context and re-reads files often will pile up tokens fast. Three places to look. First, prompt caching: is the run hitting the cache, or rebuilding the prompt every turn? Second, context bloat: huge system prompts, large repos, and many open files multiply per-call cost. Third, model choice: Opus is meaningfully pricier than Sonnet on the same workload. Set a hard spend cap and watch tokens per turn. Most overruns trace to context size, not call count.

Why do some agents get stuck or make silly mistakes?

Agents inherit their LLM's limitations. An LLM can hallucinate or misinterpret what a tool returned. Across multiple reasoning steps, these errors compound. A bad tool result leads the agent down the wrong path. Confirmation bias makes it ignore contradictory evidence. Good design mitigates the failure modes: clear tool descriptions, explicit error signals from tools, and a memory model that lets the agent backtrack rather than press on with bad state.

DEV Community: Anil Murty

What is Human-In-The-Loop (HITL)?

Why HITL matters for production agents

Three patterns

Pre-execution approval

Post-execution review

Exception-based escalation

Approval channels and their tradeoffs

The async-execution problem

Notable tools

HITL as a wedge into governance

Common questions

What is an Agent Environment?

Why it matters

Categories of environments

Code execution sandboxes

Browser sandboxes

Dev workspaces

Simulation environments for evaluation

The microVM trend

Computer-use agents

Environments and evaluation share infrastructure

Common questions

Optimizing your Claude Code usage (and spending less $$)

What is Agent Memory and why does it matter?

Why it matters

Why memory is hard

Categories of memory

Short-term and long-term

Semantic and episodic

Memory is not RAG

Notable projects

Letta (formerly MemGPT)

Mem0

Zep

LangMem

Cognee

Supermemory

Evaluating memory

Common questions

What is Agent Evaluation?

Why traditional ML evaluation falls short

Categories of evaluation

Capability benchmarks

Domain benchmarks

Open-source eval frameworks

Commercial eval platforms

The LLM-as-judge pattern

Benchmark gaming

Production eval vs offline eval

Common questions

What is an LLM Gateway?

Why teams adopt gateways

Core capabilities

The convergence: gateways and observability

Notable tools

When you need a gateway (and when you don't)

Common questions

Further reading

What is OpenTelemetry, and why does it matter for AI agents?

The three components: SDKs, OTLP, and backends

1. SDKs (instrumentation in your code)

2. OTLP: the wire protocol

3. Collectors and backends

Why agents need OpenTelemetry specifically

The GenAI semantic conventions

How real agent runtimes emit OTel today

Claude Code

LangChain

LlamaIndex + OpenInference

OpenLLMetry by Traceloop

Notable tools and SDKs

Common questions

Further reading

What is Agent Observability?

Why agent observability is harder than service observability

The three pillars, adapted for agents

What you actually capture

Common questions

Further reading