DEV Community: Rishabh Poddar

OpenAI Launches ChatGPT Work: An AI Agent for Workplace Automation

Rishabh Poddar — Mon, 13 Jul 2026 05:14:44 +0000

OpenAI's ChatGPT Work is a clear signal of where the market is heading. The company is now focusing on task execution, building tools that do the actual work people need done. According to Reuters, ChatGPT Work can execute tasks across different applications and files. OpenAI's release notes add that it can research information, generate various document types, and ask for approval before taking critical actions. It also supports Scheduled Tasks, allowing the agent to run once, repeat on a schedule, or monitor for changes. This update highlights how the AI race is centering on workplace execution rather than just chat interfaces.

What OpenAI actually launched

OpenAI launched ChatGPT Work on July 9, 2026, alongside GPT-5.6, aiming to give ChatGPT the context and capability to finish tasks rather than just discuss them.

According to Reuters, ChatGPT Work pulls context from connected apps and files to produce various document types. OpenAI's release notes highlight that users can guide the agent's progress in real time while retaining the ability to approve critical actions before they occur.

This approval mechanism is crucial. While most attention focuses on the final output, the real challenge lies in managing boundaries and permissions, ensuring the agent uses the correct files and stays within its designated workspace.

Why this matters

This shift moves enterprise AI from a simple demo to a practical operating model. While AI has long saved time on writing and research, the real change occurs when a tool can independently gather context, execute a workflow, and deliver a finished draft for review.

This capability changes the workday and introduces new risks. When agents act across multiple apps, teams must manage permissions, approvals, auditing, and rollbacks. This makes practical governance strategies, like those discussed in AI Agent Governance Is the New Enterprise Control Plane and Why Your AI Agent Should Never See Your API Keys, essential.

OpenAI's broader enterprise strategy reflects this. Their new consulting arm showed that implementation, not raw model access, is the real bottleneck. ChatGPT Work is the product version of that realization.

The part teams should pay attention to

There are two ways to view ChatGPT Work. The optimistic view is that it simplifies office tasks by letting an agent gather inputs and draft reports in one step. The practical view is that shared AI requires strict governance.

While a single person using ChatGPT Work to draft a memo is straightforward, coordinating ten people across three departments for recurring work introduces complex challenges. Teams must determine who has execution authority, which workflows require manual approval, where credentials are safely stored, how to reuse successful processes, and how to audit past agent actions. Product announcements typically overlook these operational hurdles.

Where teamcopilot.ai fits

teamcopilot.ai is built to make workplace agents usable and safe across an entire team by managing permissions, approvals, reusable skills, scheduled tasks, secrets, and audit trails. The critical factor is not just whether an agent can perform a task, but whether a team can run it safely, repeat it consistently, and audit it later.

As discussed in LiteLLM Agent Platform vs TeamCopilot, the market is quickly splitting into runtime, orchestration, and governance layers. While ChatGPT Work handles the execution layer, teamcopilot.ai operates at the team layer above it, providing the necessary control structure.

What to expect next

OpenAI will likely expand ChatGPT Work to more apps, new surfaces, and automated tasks. While that is the obvious direction, the harder question is whether companies will trust a single assistant with critical tasks without a structured control layer. The next wave of enterprise adoption will not be decided by flashy demos, but by who can make these agents reliable, reviewable, and safe enough for daily team use.

FAQ

What is ChatGPT Work?

ChatGPT Work is OpenAI's workplace-focused agent inside ChatGPT. It is designed for longer tasks involving connected apps, files, and multi-step outputs across various document types.

How is ChatGPT Work different from normal ChatGPT?

Normal ChatGPT answers prompts, whereas ChatGPT Work is designed to carry tasks forward, utilize connected context, and produce structured, autonomous work.

Does ChatGPT Work require approval for actions?

Yes, OpenAI says users can approve important actions as the agent works, keeping humans in the loop for higher-risk steps.

Can ChatGPT Work run on a schedule?

Yes, Scheduled Tasks can run once, repeat on a schedule or trigger, or monitor for changes.

Which users got access first?

Reuters reported that rollout began with Pro, Enterprise, and Edu users, with Plus and Business following shortly after.

Why does ChatGPT Work matter for enterprise AI?

It shows the market is moving beyond chatbots and into operational work. The next competition will focus on making agents useful inside actual business workflows rather than just improving model quality.

Is ChatGPT Work enough on its own for a team?

Probably not, as a team usually requires permissions, approvals, logging, shared workflows, and secret isolation to run agents safely.

How does this relate to teamcopilot.ai?

teamcopilot.ai helps teams run AI safely across shared workflows. If ChatGPT Work is the agent executing the task, teamcopilot.ai is the layer that controls access, permissions, and reuse.

What should companies watch before adopting tools like this?

Companies should evaluate access control, auditability, app permissions, secret handling, and whether the agent's workflows can be standardized and reused.

Is this just another productivity feature?

No, it is a sign that AI products are shifting from suggestion engines to task execution engines, representing a major category shift.

What is the biggest risk with workplace agents?

The biggest risk is granting too much access too early. Without controlled workflows, a helpful assistant can quickly become a security or operational hazard.

What is the practical takeaway for teams?

Use workplace agents where they save real time, but implement a proper control layer to ensure repeatable value instead of isolated experiments.

Open Source LLMs: Why Enterprises Are Moving Beyond Frontier Models

Rishabh Poddar — Sun, 12 Jul 2026 05:49:00 +0000

For a while, the default answer to almost every AI problem was simple: use the strongest frontier model you can get.

That made sense early on. Hosted frontier models were better at reasoning, more forgiving with messy prompts, and much easier to plug into a product than anything most teams could run themselves. If you wanted a prototype quickly, they were the obvious choice.

But enterprises do not live in prototype mode for long.

Once AI moves into real workflows, the questions change. How much does it cost at scale? Where does the data go? Can we audit it? Can we control it? Can we make it behave the way the business actually needs? Those are the questions pushing more teams toward open source LLMs, self-hosted deployments, and model tuning on their own data.

Frontier models are still useful, just not for everything

This is not a case for throwing frontier models out. They are still the right tool for some jobs.

If you need the strongest general reasoning, the best shot at a weird one-off task, or a model that can handle broad, ambiguous instructions with very little setup, frontier models are hard to beat. They also make sense when volume is low and the value of a top-tier answer is high.

But the catch is that most enterprise work is repetitive, policy-heavy, domain-specific, and sensitive, encompassing document processing, internal search, ticket triage, code review, compliance checks, sales follow-up, support routing, and workflow automation. These tasks usually do not need the most expensive model on the market. They need something reliable, cheap enough to run often, and safe enough to touch company data. That is where the frontier-model-everything mindset starts to fall apart.

Why enterprises move away from frontier models for routine work

Cost adds up fast because while a frontier model is cheap enough for occasional use, it turns into a real line item when you use it thousands of times a day across many teams. The more the workflow repeats, the less sense it makes to rent the most expensive intelligence for every step.

Beyond budget, managing sensitive data presents a major hurdle. Many enterprise workflows involve contracts, customer records, internal strategy, code, legal documents, or regulated data. Sending all of that to a third-party API is a non-starter for a lot of companies, especially when residency and retention rules are strict.

Generic models also tend to miss the local details that make enterprise work hard. They know a lot, but they do not know your schema, your product language, your approval rules, or the difference between a real exception and a normal part of your process. That is why teams often get answers that are technically fine and still not useful.

Finally, relying on a single vendor introduces operational risk. If the price changes, the API changes, the policy changes, or the product roadmap shifts, your workflow can get expensive or brittle overnight. Self-hosted and open source models reduce that dependency.

What open source LLMs change

Open source models give enterprises more control over the full stack, allowing you to run them on your own infrastructure, keep sensitive data inside your environment, and choose the model size that fits the task. Some teams might deploy a smaller model to handle classification or extraction, while others require a larger model tuned for internal knowledge or a narrow business domain.

The real advantage is fit, rather than ownership for its own sake. If you know the use case well, you can choose a model that is good enough instead of paying for a model that is better in theory but wasteful in practice. That usually means faster responses, lower cost, and fewer surprises. There is also a second benefit: you can tune the model to the business instead of asking the business to bend around the model.

Fine-tuning on your own data

For many enterprise use cases, fine-tuning is the first serious step beyond generic prompting, letting a model learn your tone, your labels, your formats, your domain terms, and your preferred decision patterns. It helps when the base model already knows the general idea but not the details you actually care about.

That is useful in a lot of places:

support teams that need consistent response style
legal teams that need clause classification
finance teams that need structured extraction
recruiting teams that need better resume ranking
internal assistants that need company-specific terminology

The point is not to make the model smarter in the abstract, but to make it useful for a task you repeat constantly.

While RAG is great for keeping facts current, fine-tuning is better for maintaining consistent behavior, formatting, or decision style.

When RL-based training helps

While supervised fine-tuning is often enough, some problems require reinforcement learning techniques like RLHF, RLAIF, and related preference-based training methods. These approaches are useful when the model needs to learn to produce high-quality, context-appropriate answers rather than just predicting the next token.

This matters for enterprise work because a lot of business output is judged by more than factual correctness. It needs to be concise, compliant, on-brand, and action-oriented. RL-style training can help push a model toward those outcomes when plain prompting keeps drifting.

In practice, the rule is simple: use the lightest method that gets the job done.

Start with prompting. Add retrieval. Then fine-tune if the same behavior keeps showing up across many workflows. Use RL-style methods when you need the model to internalize a preference structure that simple supervised data does not capture well.

The practical enterprise pattern

Most teams do not need to replace frontier models everywhere; instead, they need a split strategy. You can deploy frontier models for complex, rare, or high-stakes tasks that justify the extra cost, while routing frequent, predictable workflows to open-source or self-hosted models. This keeps sensitive data close to home and reserves expensive API calls for the moments that actually require them.

This architecture typically follows a clear pattern:

A smaller self-hosted model handles the bulk of routine work.
A fine-tuned model handles a specific internal workflow.
A frontier model is reserved for complex reasoning or fallback.
Approvals and logging sit around anything that can take action.

That kind of architecture is boring in the best way. It is cheaper, easier to govern, and much easier to explain to security, legal, and finance teams.

Why this matters for agents

The model is only one part of an agent system.

If an agent can call tools, read documents, or trigger workflows, then the model choice matters less than the surrounding controls. That is why teams are moving from "Which model should we use?" to "Which model should do which job, and under what rules?"

For a deeper look at the control side of this shift, see AI Agent Governance Is the New Enterprise Control Plane and Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails.

If your team is also deciding how much context to keep around the model, What Is Prompt Engineering? A Practical Guide to Context Engineering and KV Cache is a useful companion piece.

Where teamcopilot.ai fits

teamcopilot.ai supports this deployment logic. Teams need reusable workflows, clear approvals, and a way to route different tasks to the right level of intelligence without making every employee reinvent the setup.

That matters most when some workflows can safely use a local or fine-tuned model while others should still fall back to a frontier model. A shared system makes that pattern manageable.

The bottom line

Frontier models are still the ceiling in many cases. They are just no longer the default answer for every enterprise task.

For a growing number of teams, the better answer is a mix of open source LLMs, self-hosted deployment, fine-tuning on proprietary data, and RL-based training when behavior needs to be shaped more carefully. That mix gives enterprises more control, lower cost, and a system that fits the work instead of forcing the work to fit the model.

FAQ

What is a frontier model?

A frontier model is one of the most capable general-purpose LLMs available from a major provider. These models are usually the strongest option for broad reasoning and difficult open-ended tasks.

Why would an enterprise use an open source LLM instead?

Because open source models can be self-hosted, tuned on proprietary data, and used with more control over privacy, cost, and deployment.

Are open source LLMs always cheaper?

Not always upfront; you may spend more on infrastructure and setup, but at higher volume, self-hosted models can be much cheaper than paying frontier-model API costs for every request.

When should a company still use a frontier model?

Use it for tasks that are rare, complex, or high value enough to justify the cost, or when the best possible answer matters more than control and price.

What kind of tasks fit self-hosted models best?

High-volume tasks, sensitive workflows, classification, extraction, support routing, internal search, and repetitive business processes are usually a good fit.

What is fine-tuning in this context?

Fine-tuning means training a base model further on your own data so it behaves more like your use case needs, whether that means output format, tone, labels, or domain behavior.

Is fine-tuning better than retrieval-augmented generation?

They solve different problems: RAG is better for fresh facts and documents, while fine-tuning is better for consistent behavior, style, and task-specific patterns.

When do RLHF or RLAIF matter?

They matter when you need the model to prefer one style of answer over another, or when simple supervised tuning is not enough to shape the behavior you want.

Do enterprises need to train models from scratch?

Usually not, as most teams should start with an existing open source model and then tune it for their needs.

What is the biggest mistake companies make with enterprise AI?

Using the strongest model for everything. That often creates unnecessary cost, weak governance, and a system that is harder to maintain than it needs to be.

Is self-hosting only for very large companies?

No. It used to be much harder, but modern open source models and deployment tooling make self-hosting realistic for smaller teams too, depending on the workload.

How do you decide between a small model and a frontier model?

Start with the smallest model that can do the job reliably. Move up only if quality, complexity, or failure cost makes it necessary.

What about compliance and data residency?

Those are often the main reasons to self-host. If data cannot leave your environment, open source models become much more attractive.

Can open source models handle enterprise language well?

Yes, especially when they are tuned on your own terminology, internal docs, and workflow examples.

How does teamcopilot.ai help with this shift?

teamcopilot.ai helps teams build reusable workflows with shared controls, so they can choose the right model for each task without losing governance or repeatability.

What Is an Agent Gateway? Why It's Becoming the Control Plane for Enterprise AI

Rishabh Poddar — Wed, 08 Jul 2026 07:11:47 +0000

Agent gateways are having a moment. For once, the hype is pointing at something real.

AI agents do not just answer questions anymore. They call tools, hit APIs, move data, and sometimes take actions in production systems. Once that happens, you need a gateway layer between the agent and everything it touches.

Recent industry moves show how quickly this is moving. Palo Alto Networks acquired Portkey to add AI gateway controls to Prisma AIRS, while Nutanix launched its own Agent Gateway for governance and cost control. Meanwhile, AAIF brought agentgateway under open governance, and Arcade expanded its governed agent runtime to AWS and Azure marketplaces. These different approaches point to the same conclusion: the control layer is no longer optional.

What an agent gateway does

An agent gateway sits between the agent and the tools or models it uses. It can route requests, enforce permissions, limit access, record activity, and make sure the right policy is applied before anything happens.

That may sound like standard infrastructure work. It is. But it matters more for agents because agents are not passive. A chatbot can suggest. An agent can act.

Once you let software take action on your behalf, the old architecture starts to break down. You do not just need a model endpoint. You need a system that identifies the requesting agent, verifies its access permissions, and checks if a human needs to approve the step. It must also log activity for audits and provide a kill switch if something goes wrong. The gateway layer turns raw agent capability into something an enterprise can actually govern.

Why this is showing up now

This shift is happening because the market has moved past toy demos. Teams are connecting agents to GitHub, Slack, Stripe, internal APIs, databases, and deployment systems, which is where the real risk lives.

While drafting text carries minimal risk, allowing an agent to open pull requests or trigger workflows introduces real danger.

The Forbes article points to several vendors moving in this direction, including Nutanix, Arcade, Manufact, and open projects like agentgateway. The details differ, but the direction is the same. Everyone is trying to answer a single question: how do you keep agents useful without letting them roam freely?

There is also a clear split in how the market is answering that question. While some vendors aim to own the entire control plane within a larger security platform, others prioritize an open, portable gateway so teams can carry it across models, clouds, and runtimes. That tension is going to shape the category for a while.

That question shows up in our earlier post on AI Agent Governance Is the New Enterprise Control Plane because that is really what this category is becoming. A gateway that actively enforces policy matters more than one that simply routes data.

Why enterprises care

Enterprises do not buy AI because it is clever. They buy it to save time, reduce manual work, and fit into existing operations. But that only works if the system is predictable enough to trust.

Agent gateways help by reducing shadow access and making it easier to separate read and write actions. This gives security teams a central place to enforce policy, while finance teams gain visibility into token spend.

A lot of agent cost problems are really control problems. If every workflow can call the most expensive model all the time, the bill grows fast. A gateway can route work to cheaper models, block wasteful calls, or stop runaway loops before they eat budget.

The same idea appears in MCP vs Skills: Why Skills Save Context Tokens. The goal is simple: each layer should be smaller, cheaper, and easier to control than the one before it.

The market is still messy

The category is still being defined. Vendors approach the problem from various angles, including infrastructure, authorization, security, and open platform layers. This variety makes sense, because the same buyer can be looking at the problem from several angles at once.

The open-source angle is especially interesting. A neutral gateway layer could become the place where teams standardize how agents connect to tools, regardless of model provider. That would be a big deal for organizations that do not want their whole operating model tied to one vendor.

At the same time, the market is full of overlap. The line between agent gateway, agent harness, identity layer, and control plane is still fuzzy. That is not a bug. It is what early categories look like before the naming settles.

For teams comparing options, our post on Best AI Agent Platforms for Teams in 2026: Comparing 13 Tools is a good companion read, because the gateway is only one piece of the stack.

What teams should do now

If you are building or buying agent systems, do not start with the biggest model or the longest prompt. Start with control.

Ask these questions first:

What can the agent access by default?
Which actions require approval?
Where are secrets stored?
Can we log every action and decision?
Can we revoke access quickly if something breaks?
Can different teams use the same system without stepping on each other?

Those questions sound basic, but they are where the failures happen.

This is also where Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails becomes practical instead of theoretical. A gateway without approvals is just a fancier connector. A gateway with approvals and audit trails becomes part of an actual operating model.

And if your agent can ever see raw credentials, read Why Your AI Agent Should Never See Your API Keys before you go any further.

If you are designing the system from scratch, a shared workflow layer like teamcopilot.ai can save you from rebuilding the same guardrails every time. You get approvals, permissions, and reusable automation in one place instead of stitching them together per project.

Where teamcopilot.ai fits

teamcopilot.ai is built specifically to address these control and workflow challenges.

Instead of just running agents, teamcopilot.ai lets teams control who runs them, what they can touch, and when a human needs to step in. That is the difference between a demo and a workflow that belongs in production.

If you want a shared team system where workflows, approvals, permissions, and secrets are part of the design from the start, teamcopilot.ai gives you that layer without forcing every team to invent it from scratch.

That matters because most teams fail during the handoff between a smart model and a real action. teamcopilot.ai is useful there because it makes the handoff explicit.

The bigger point

Agent gateways are interesting because they mark a shift in what people think AI infrastructure is.

At first, the race was about model quality. Then it was about context. Then it was about tools. Now it is about control.

This shift is healthy. Enterprises do not need agents that can do everything. They need agents that do the right things, in the right way, under the right rules. That is exactly where the gateway comes in.

FAQ

What is an agent gateway?

It is the control layer between an agent and the tools or models it uses. In practice, it handles routing, permissions, logging, policy, and access control.

Why do enterprises need one?

Because agents can act, not just suggest. Once they can call tools or change systems, enterprises need guardrails around access, approvals, and auditing.

What problem is the market actually solving?

Teams want agents that can work across tools without turning every integration into a security exception. The gateway gives them one place to enforce policy instead of spreading rules across every app.

Is an agent gateway the same as an agent harness?

While an agent harness provides the broader runtime environment, a gateway acts as the specific control point governing what the agent can reach and how traffic flows.

Is this just another name for API management?

Unlike traditional API management that focuses on service traffic, agent gateways must handle model routing, prompt context, tool calls, human approvals, token costs, and dynamic mid-run behaviors.

How does an agent gateway help with security?

It can keep raw credentials away from the model, limit tool access, enforce policy, and record what happened for later review.

How does it help with cost?

It can route work to the right model, stop repeated or wasteful calls, and make token usage visible enough that teams can actually manage it.

Does an agent gateway reduce cost?

Yes, if it is used well. Good gateways make it easier to route work to cheaper models, stop wasteful calls, and keep runaway workflows from burning tokens.

Open source or vendor platform, which is better?

It depends on the team. Teams prioritizing portability and neutrality often favor open-source options, whereas those wanting a unified, out-of-the-box governance and security suite tend to choose vendor platforms. Most enterprises will end up using a mix.

How is teamcopilot.ai different?

teamcopilot.ai focuses on the workflow layer itself: approvals, permissions, shared automation, and secrets management. It gives teams a way to run agents with guardrails instead of improvising them.

When should a team care about this topic?

As soon as an agent can do more than draft text. If it can touch data, tools, or production systems, the gateway question becomes real.

What is the biggest risk without a gateway?

Unbounded access. The agent may look helpful right up until it reaches something sensitive, expensive, or hard to undo.

What is the simplest first step?

Start by separating read-only workflows from workflows that can take action. That one split makes the rest of the design much clearer.

What should I read next?

If you want the broader context, start with AI Agent Governance Is the New Enterprise Control Plane and Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails. If you are thinking about the protocol layer, MCP vs Skills: Why Skills Save Context Tokens is the best next stop.

If you want the practical safety side, Why Your AI Agent Should Never See Your API Keys is still one of the most important reads in the series.

What Is an Agent Harness? The Missing Layer Between a Model and a Working AI Agent

Rishabh Poddar — Sun, 05 Jul 2026 04:51:50 +0000

People keep using the word "harness" because it points to the part of the system that actually makes an AI agent useful.

The model does the reasoning. The harness gives it a place to run, tools to call, memory to use, and rules to follow. Strip the harness away and you usually do not have an agent anymore. You have a model that can talk.

An agent is simply a model combined with a harness:

Agent = Model + Harness

The short version

An agent harness is the software layer around a model that turns it into something that can actually do work.

It decides:

what context the model sees
which tools it can use
where code runs
what gets stored between steps
when to ask for approval
how to verify whether work is actually done

That is why harness engineering matters. The model is only one part of the job. The rest is the runtime around it, and most of the headaches live there.

If you want the loop itself, our post on What Is an Agent Loop? How AI Agents Reason, Act, and Iterate is the right companion read. The loop is the motion. The harness is the thing that keeps the motion useful.

Why the term matters now

For a long time, "AI" mostly meant chat. You asked a question, got an answer, and moved on.

Agent systems changed that. Now the model can read files, call APIs, run code, update tickets, and keep going across multiple steps. Once that happens, the old mental model stops working. The real question is no longer "what did the model say?" It is "what did the full system do?"

Harnesses make action possible without turning everything into chaos.

Anthropic's work on long-running agents makes this plain. Rather than dismissing models, they argue that long tasks need structure, clean state, and a way to continue across context windows. LangChain says something similar in its own harness write-up: the harness is the code, configuration, and execution logic that is not the model itself.

The industry is converging on the same idea from different angles. The model reasons. The harness makes the work real.

What lives inside a harness

A good harness is not one thing. It is a stack of small parts that work together.

Managing context

The harness decides what the model sees. That sounds minor until you watch an agent fail because it was given too much irrelevant context or not enough of the right context.

This is one reason posts like MCP vs Skills: Why Skills Save Context Tokens matter. If you keep stuffing every possible tool and instruction into every session, the agent spends energy just figuring out what is relevant.

How tools actually run

The model can suggest an action, but the harness actually runs it. This distinction matters: if the model says "run tests" or "query the database," the harness turns that suggestion into a real command, API call, or sandboxed action.

Memory and state

Real work does not fit inside one prompt. Agents need a way to remember what happened earlier, what failed, what still needs attention, and what should carry over to the next step. Without state, every new turn feels like starting over.

Guardrails and approvals

This is where the harness stops being abstract. If an agent can send money, delete files, change permissions, or touch production, you want approval gates around those steps. Our post on Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails goes deeper on this, but the basic point is simple: autonomy without boundaries is just risk with a nicer interface.

Verifying the output

Agents make mistakes. A harness should check the work.

That can mean tests, lint checks, policy checks, or simple validation rules. In practice, verification is what stops a confident wrong answer from becoming a shipped mistake.

Observability

If you cannot see what the agent did, you cannot fix it, trust it, or explain it later.

Logs, traces, and audit trails turn agent behavior into something a team can inspect. That is especially important once multiple people depend on the same agent.

Harness vs framework vs agent

A model is the thing that reasons and generates output.
A framework gives you building blocks for making an agent.
A harness is the actual runtime behavior around the model.
An agent is the finished system.

You can think of a framework as the parts list and a harness as the working machine.

That is why the same model can feel very different in different products. A weak harness drags down even a strong model, while a good one makes a smaller model surprisingly capable.

Why harness quality matters more than people think

Most agent failures are not dramatic. They are boring.

The agent loads too much context. It gets stuck in a loop. It uses the wrong tool. It stops too early. It takes one risky action too freely. None of that sounds glamorous, but that is where real systems break.

The best way I have heard it put is this: the model is the brain, and the harness is everything that lets the brain do something useful in the real world.

That is also why posts like Coding Agent Best Practices: How to Set Up AI Agents Securely and Productively stay relevant. A lot of the value is in the setup around the model, not the prompt alone.

Where teamcopilot.ai fits

To make an AI agent do useful work for a team, you need more than a chat box. You need permissions, approvals, secret handling, and a clean record of what happened, allowing the model to act without letting it run wild.

That is what a team harness looks like in practice. The goal is to make the model safe and structured enough that a team can actually rely on it.

What to look for in a good agent harness

If you are evaluating an agent system, look for these signs:

it keeps context small and relevant
it scopes tool access tightly
it can persist useful state between steps
it asks for approval on risky actions
it can prove what happened after the fact
it checks work before declaring success
it fails in a way a human can recover from

If those pieces are missing, the system may still look impressive in a demo. It will just be fragile in production.

The simple test

You can easily tell if someone understands agent harnesses by asking them what happens after the model says, "I am done."

If they focus only on the answer, they are still thinking about chat. But if they start talking about validation, approvals, logs, state, and the next run, they understand the harness.

That is the real shift.

FAQ

What is an agent harness in simple terms?

An agent harness is the software around a model that lets it act on the world by handling context, tools, memory, safety checks, and execution.

Is an agent harness the same as an agent framework?

Not exactly. A framework gives you libraries and patterns for building agents. The harness is the actual runtime behavior that makes the agent work in practice.

Why do AI agents need a harness at all?

Models cannot run tools, keep durable state, or enforce permissions by themselves, so the harness fills that gap.

What are the most important parts of a harness?

Context management, tool execution, memory, guardrails, verification, and observability are the core pieces.

Does a better model make the harness less important?

Not really. Better models help, but they do not remove the need for control, state, approvals, and recovery.

What goes wrong when the harness is weak?

Agents get noisy, slow, unsafe, or inconsistent. They may use too much context, pick bad tools, or take risky actions without enough checks.

How is an agent harness different from a chatbot UI?

A chatbot UI mostly handles conversation. A harness handles execution. That is the difference between talking about work and actually doing it.

Where does teamcopilot.ai fit into this?

teamcopilot.ai is the kind of system that puts a harness around team workflows, so agents can work with permissions, approvals, secret handling, and audit trails.

Do I need a harness if I only use agents for small tasks?

Even small tasks benefit from basic guardrails and validation. You may not need a heavy setup, but you still need a runtime that keeps the agent honest.

What should I read next?

Start with What Is an Agent Loop? How AI Agents Reason, Act, and Iterate, then read Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails and Why Your AI Agent Should Never See Your API Keys.

The model provides the reasoning, but the harness is what actually gets the work done.

How LLM Caching Works: Prompt Caching, KV Cache, and Semantic Caching for Developers

Rishabh Poddar — Fri, 03 Jul 2026 08:25:04 +0000

Every time you call an LLM API, you pay for every input token processed, even if 90% of the prompt is the same system prompt you sent in the last thousand requests. At scale, that's a predictable waste. Caching fixes it.

This post is a ground-up technical explanation of how LLM caching works: from the KV cache baked into transformer architecture, to the prompt caching APIs exposed by providers like Anthropic and OpenAI, to semantic caching that lets you reuse responses for questions that mean the same thing even if the wording differs. By the end you'll know how to structure your prompts, configure the APIs, measure the savings, and avoid the traps that silently kill your cache hit rate.

Why LLM Caching Matters

At a high level, two things make LLMs expensive to call:

Cost: Input tokens usually cost significantly less than output tokens, but large context windows mean input costs still dominate agentic and RAG workloads. A 10,000-token system prompt sent 1,000 times per day adds up fast.

Latency: LLMs generate output serially, one token at a time. But before they can generate the first output token, they have to process all the input tokens. For a large context, that prefill phase alone adds hundreds of milliseconds.

Caching attacks both problems. When the same input tokens (or semantically equivalent questions) have been seen before, the model can skip recomputation and go straight to generation.

The Three Layers of LLM Caching

Layer	Where it lives	What it caches	Managed by
KV Cache	GPU memory, inside the inference engine	Computed attention keys and values	Model provider
Prompt Cache / Prefix Cache	Provider servers	Pre-processed token prefixes	Provider (opt-in via API)
Semantic Cache	Your infrastructure	LLM responses keyed by embedding similarity	You (Redis, Qdrant, etc.)

These layers are complementary, not competing. Let's explore each one.

The KV Cache: What Transformers Cache Under the Hood

To understand LLM caching, you have to understand a little about how transformer attention works, because the KV cache is built directly into the attention mechanism.

How Transformer Attention Works

In a transformer, every token in the sequence "attends" to every previous token. This is the attention mechanism. Concretely, for each token the model computes three matrices:

Query (Q): What this token is looking for
Key (K): What this token offers as a signal
Value (V): What this token contributes to the output

Attention is: softmax(Q · Kᵀ / √d) · V

For autoregressive generation (generating one token at a time), the model generates token n+1 by:

Computing Q, K, V for the new token
Attending over K and V for all previous tokens
Outputting the next token

The expensive part: recomputing K and V for all previous tokens at every step would be quadratic in the sequence length. Inference engines avoid this by caching K and V matrices for all tokens already processed. This is the KV cache.

KV Cache in Practice

The KV cache is automatic and invisible; it's an implementation detail of the inference engine (vLLM, TensorRT-LLM, SGLang, etc.). You don't opt into it. Every production LLM deployment uses it.

What it does:

During output generation: For each new token generated, the K and V for all previously generated tokens are already in cache. Only the new token needs fresh K, V computation.
Capacity constraint: KV cache requires GPU memory, roughly proportional to batch size × sequence length × number of layers × head dimensions. At large context lengths or high concurrency, KV cache fills up and the inference engine must evict entries.

What it doesn't do:

It doesn't persist between separate API requests. Each new API call starts with an empty KV cache for the input tokens (unless the provider implements prefix caching on top, which is the next layer).

Prompt Caching: Persisting the KV Cache Across Requests

Prompt caching (also called prefix caching or inference caching) is what happens when an LLM provider takes the KV cache idea and makes it persistent across separate API calls.

If request A and request B share the same 9,000-token system prompt, there's no reason to recompute the K and V matrices for those 9,000 tokens on request B. The provider can store the computed KV cache from request A and reuse it.

The Prefix Match Invariant

Here's the most important rule for prompt caching: caching is based on exact prefix match, byte for byte.

The provider compares the incoming prompt against cached prompts at the token level. The cache only applies to the longest matching prefix. The moment a single token differs, the cache ends there, and everything after that point must be recomputed.

This means:

[system: "You are a helpful assistant."][user: "What is 2+2?"] shares a cache with [system: "You are a helpful assistant."][user: "What is 3+3?"], so the system prompt prefix is cached and only the differing user message is recomputed.
[system: "You are a helpful assistant."] and [system: "You are a helpful assistant. Reply in French."] still share the cached prefix up through "You are a helpful assistant.". Only the appended text onward is new. Editing the end of a prompt does not throw away the cache for everything before the edit. Likewise, "You are a helpful assistant (v2)." keeps the cache up to "You are a helpful assistant" and only diverges at the " (v2)" tokens.
What does cause a near-total miss is changing something at the very start. [system: "[build 7f3a] You are a helpful assistant."] differs on its first tokens, so nothing after them matches and the whole prefix is recomputed. This is why volatile values / variables in your system prompt belong at the end of the stable region, never the front.

A byte change invalidates the cache from that point forward, but everything before the change stays cached. This is the single most important thing to internalize about prompt caching.

NOTE: In practice, providers only cache prefixes above a minimum length and round the cached region down to a fixed block boundary, so very small matches like the five tokens above aren't independently cacheable.

Provider APIs: How to Enable Prompt Caching

Anthropic (Claude)

Claude offers two modes for enabling prompt caching — automatic and explicit — both using cache_control, but placed differently.

Automatic caching (recommended for multi-turn conversations): place cache_control at the top level of the request. The API automatically moves the cache breakpoint to the last cacheable block on every request, so as your conversation history grows, old turns are read from cache and only new turns are written. You don't manage breakpoint placement yourself.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},  # top-level = automatic mode
    system="You are an expert software engineer...",
    messages=[
        {"role": "user", "content": "What is a linked list?"},
        {"role": "assistant", "content": "A linked list is..."},
        {"role": "user", "content": user_question},  # new turn, rest read from cache
    ]
)

Explicit caching (recommended for static content like RAG documents or system prompts): place cache_control on individual content blocks. The provider caches everything from the start of the prompt up through the marked block. You can place up to 4 breakpoints per request to create tiered cache boundaries.

response = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert software engineer...\n\n" + large_codebase_context,
            "cache_control": {"type": "ephemeral"},        # 5-minute TTL (default)
            # "cache_control": {"type": "ephemeral", "ttl": "1h"},  # 1-hour TTL
        }
    ],
    messages=[
        {"role": "user", "content": user_question}  # dynamic, not cached
    ]
)

# Check cache status in response
print(response.usage.cache_read_input_tokens)     # tokens served from cache
print(response.usage.cache_creation_input_tokens) # tokens written to cache
print(response.usage.input_tokens)                # tokens computed normally

cache_control has two jobs:

Set the cache boundary — marks where the cached prefix ends (explicit mode) or moves automatically to the last block (automatic mode).
Set the TTL — {"type": "ephemeral"} = 5-minute cache; {"type": "ephemeral", "ttl": "1h"} = 1-hour cache.

Effect on pricing:

5-minute TTL ({"type": "ephemeral"}): 1.25× base input price to write, 0.1× base price to read
1-hour TTL ({"type": "ephemeral", "ttl": "1h"}): 2× base input price to write, 0.1× base price to read

The write premium only applies on cache creation. On a cache hit you pay 10% of the normal token price. Use the 1-hour TTL when your conversation sessions routinely outlast 5 minutes; pay the higher write cost once and read cheaply for up to an hour.

Minimum cacheable prefix by model (Prefixes shorter than the minimum for that model don't cache):

Claude Opus 4.8 / Sonnet: 1,024 tokens
Claude Haiku 4.5: 4,096 tokens
Claude Fable 5: 512 tokens

OpenAI

OpenAI's prompt caching is automatic — no cache_control markers needed. Any prompt of 1,024+ tokens is eligible and the cache is checked on every request. You see cache usage in the usage object:

import openai

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": large_system_prompt},
        {"role": "user", "content": user_question},
    ],
    # Optional: control cache retention policy
    # prompt_cache_retention="in_memory",  # 5-10 min inactivity window, max 1h (default)
    # prompt_cache_retention="24h",        # up to 24h (model-dependent)
)

# Check cache usage
cached_tokens = response.usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens: {cached_tokens}")
print(f"Total prompt tokens: {response.usage.prompt_tokens}")

Two retention policies are available via prompt_cache_retention:

"in_memory" (default): cache evicted after 5–10 minutes of inactivity, max 1 hour
"24h": cache persists up to 24 hours (available on supported models)

You can also pass prompt_cache_key to influence request routing and improve hit rates when many requests share the same prefix across different sessions.

OpenAI's cached tokens are priced at 50% of the normal input token price.

AWS Bedrock

AWS Bedrock uses explicit cachePoint markers (Converse API) or cache_control blocks (InvokeModel API for Claude). TTL is configurable per checkpoint — omitting it defaults to 5 minutes.

import boto3

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

response = bedrock.converse(
    modelId="anthropic.claude-opus-4-5-20251101-v1:0",
    system=[
        {"text": large_system_prompt},
        {
            "cachePoint": {
                "type": "default",
                # "ttl": "1h",  # optional; defaults to "5m"; 1h only on supported models
            }
        },
    ],
    messages=[
        {"role": "user", "content": [{"text": user_question}]}
    ]
)

usage = response["usage"]
print(f"Cache read tokens: {usage.get('cacheReadInputTokens', 0)}")
print(f"Cache write tokens: {usage.get('cacheWriteInputTokens', 0)}")

Note that cachePoint is a separate object in the array (not a field on the text block), unlike Claude's direct API where cache_control sits on the content block itself.

Two TTL tiers, same as the direct Claude API:

"5m" (default): 5-minute cache, 1.25× write cost
"1h": 1-hour cache, 2× write cost — only available on Claude Opus 4.5, Sonnet 4.5, and Haiku 4.5

Minimum tokens per checkpoint also varies by model: 1,024 tokens for Claude Opus 4 / Sonnet 4.6 / Claude 3.x, 4,096 tokens for Claude Opus 4.5 / Sonnet 4.5 / Haiku 4.5. Up to 4 checkpoints per request.

How to Structure Prompts to Maximize Cache Hits

The rule is simple: any variable or dynamic value in your prompt should appear as late as possible. The cache is a prefix match — everything before the first change is reusable, everything after it is recomputed. So if your system prompt starts with f"Current time: {datetime.now()}", that timestamp changes every request and busts the cache for every token that follows it, including the thousands of tokens of static instructions below it.

# BAD: dynamic value at the top poisons everything below it
system_prompt = f"""
Current time: {datetime.now().isoformat()}
User tier: {user_tier}

You are a helpful assistant. [... 5,000 tokens of instructions ...]
"""

# GOOD: stable instructions first, dynamic values at the end of the system prompt
system_prompt = f"""
You are a helpful assistant. [... 5,000 tokens of instructions ...]

User tier: {user_tier}
Current time: {datetime.now().isoformat()}
"""

The same applies to any variable inside the system prompt: put it after all the static instructions, not before them.

Semantic Caching: Caching by Meaning

Prompt caching and KV caching are exact-match: if a single byte changes, the cache doesn't apply. Semantic caching takes a different approach: it caches LLM responses and uses embedding similarity to match new questions against previously answered ones.

The core idea: "What is the capital of France?" and "Tell me the capital city of France" mean the same thing. Instead of calling the LLM again, semantic caching returns the cached response from the first question.

How Semantic Caching Works

Embed the query: Convert the incoming question into an embedding vector (a dense numerical representation of meaning).
Search the cache: Perform a nearest-neighbor search in a vector store to find the most similar previously-asked question.
Check the threshold: If the cosine similarity between the new query and the cached query exceeds a threshold (e.g., 0.95), it's a cache hit. Return the stored response.
On a miss: Call the LLM, then embed the query-response pair and store it in the vector cache for future hits.

Semantic Caching Tradeoffs

Semantic caching is powerful but has real tradeoffs you should understand before deploying it:

Consideration	Impact
Non-determinism risk	You're returning a response from a previous question that's similar, not identical. If the original question was "What's the EU VAT rate?" and the new one is "What's the UK VAT rate?", a similarity score of 0.93 might incorrectly serve the EU response.
Threshold tuning	Too high (0.99): few cache hits, near-zero benefit. Too low (0.80): dangerous hallucination risk. 0.92–0.97 is usually a safe starting range.
Context sensitivity	Semantic caching works best for factual Q&A and structured tasks. It breaks down for tasks where context matters heavily (e.g., "summarize this document").
Cache invalidation	When facts change (prices, policies, real-world events), stale semantic cache entries return outdated information. Implement TTLs and a way to bust the cache.
Extra latency on miss	A cache miss now costs: embed query + vector search + LLM call. That's slightly more latency than a plain LLM call. The bet is that high hit rates amortize this overhead.
Infrastructure cost	You're now operating a vector store. Not free. At very low query volumes, the infrastructure cost may exceed the LLM savings.

Provider Comparison: OpenAI vs. Claude vs. Bedrock

Here's a side-by-side comparison of prompt caching support across major providers:

Feature	Anthropic Claude	OpenAI	AWS Bedrock
Caching type	Explicit (cache_control)	Automatic	Explicit (cachePoint)
Minimum prefix	1,024–2,048 tokens	1,024 tokens	Varies by model
TTL	5 min or 1 hour (opt-in)	~5 min (auto)	~5 min
Cache read price	10% of base input	50% of base input	10% of base input
Cache write price	125–200% of base input	Base input price	125% of base input
Cache hit detection	`cache_read_input_tokens`	`cached_tokens` in `prompt_tokens_details`	`cacheReadInputTokens`
Max breakpoints	4 per request	N/A (automatic)	1 per request
Tool caching	Yes (tools count as prefix)	Yes	Partial
Zero-retention orgs	Supported	Supported	Supported

Cost Comparison Example

Scenario: 10,000-token system prompt, 500-token user message, 500-token response, 10,000 requests/day.

Without caching: 10,500 input tokens × price/token × 10,000 requests

With Claude prompt caching (5-min TTL, 100% hit rate after first request):

Cache write cost: 10,000 tokens × 1.25× price (once per 5 minutes = ~288 writes/day)
Cache read cost: 10,000 tokens × 0.1× price × 10,000 requests
Non-cached tokens: 500 tokens × price × 10,000 requests

At Claude Opus 4.8 pricing ($5/1M input tokens):

Without caching: 10,500 × $5/1M × 10,000 = $525/day
With caching: (288 writes × 10,000 × $6.25/1M) + (10,000 × 10,000 × $0.50/1M) + (10,000 × 500 × $5/1M) ≈ $18 + $50 + $25 = $93/day

Savings: ~82% reduction in input token costs.

Measuring Cache ROI in Production

Knowing your cache hit rate is critical. A cache that's not actually hitting is a false economy (you're paying the write premium for nothing).

Tracking Cache Metrics with Claude

import anthropic
from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class CacheMetrics:
    total_requests: int = 0
    cache_hits: int = 0
    cache_misses: int = 0
    total_input_tokens: int = 0
    cache_read_tokens: int = 0
    cache_write_tokens: int = 0

    @property
    def hit_rate(self) -> float:
        if self.total_requests == 0:
            return 0.0
        return self.cache_hits / self.total_requests

    @property
    def estimated_savings_usd(self) -> float:
        # Claude Opus 4.8: $5/1M input, $0.50/1M cache read, $6.25/1M cache write
        base_cost = self.total_input_tokens * 5 / 1_000_000
        actual_cost = (
            (self.cache_read_tokens * 0.50 / 1_000_000)
            + (self.cache_write_tokens * 6.25 / 1_000_000)
            + ((self.total_input_tokens - self.cache_read_tokens - self.cache_write_tokens) * 5 / 1_000_000)
        )
        return base_cost - actual_cost

metrics = CacheMetrics()

def tracked_llm_call(system_prompt: str, user_message: str) -> str:
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        system=[{
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"},
        }],
        messages=[{"role": "user", "content": user_message}],
    )

    usage = response.usage
    metrics.total_requests += 1
    metrics.total_input_tokens += usage.input_tokens
    metrics.cache_read_tokens += usage.cache_read_input_tokens or 0
    metrics.cache_write_tokens += usage.cache_creation_input_tokens or 0

    if usage.cache_read_input_tokens and usage.cache_read_input_tokens > 0:
        metrics.cache_hits += 1
    else:
        metrics.cache_misses += 1

    return response.content[0].text

# After some requests:
def print_metrics():
    print(f"Hit rate: {metrics.hit_rate:.1%}")
    print(f"Cache read tokens: {metrics.cache_read_tokens:,}")
    print(f"Estimated savings: ${metrics.estimated_savings_usd:.2f}")

Key Metrics to Track

Metric	Formula	What it tells you
Cache hit rate	`cache_hits / total_requests`	Primary efficiency signal
Cache coverage	`cache_read_tokens / total_input_tokens`	What % of your tokens come from cache
Cost per request (cached)	`(cache_read + uncached) cost / requests`	Actual cost vs. baseline
TTFT improvement	`p50 latency (hit) / p50 latency (miss)`	Latency benefit from caching
Write amortization	`cache_read_hits / cache_write_events`	Are writes paying for themselves?

When Caching Isn't Worth It

Caching has a write premium: you pay extra on the first request (cache creation). This only pays off if subsequent requests reuse that cache. Caching isn't worth it when:

Low traffic: If you have fewer than 100 requests/day sharing the same system prompt, the write premium may exceed the savings.
Short TTL relative to traffic pattern: If requests come in bursts with long gaps, the cache expires between bursts and every request pays the write premium.
High prompt variability: If every request has a unique prefix (e.g., each user has a fully personalized system prompt), there's nothing stable to cache.
Very small prompts: Below the minimum prefix length (1,024–4,096 tokens), caching doesn't apply at all.

Common Pitfalls and How to Avoid Them

1. Prefix Invalidation Cascades

Changing anything early in the prompt invalidates everything after it. If your system prompt has a version number in it:

# BAD
system_prompt = f"System v{VERSION}. Instructions: ..."

# After a version bump, 100% cache miss until TTL expires

Keep version metadata out of the main prompt text. If you need to track which prompt version produced a response, store it in your application metadata, not in the prompt itself.

2. Tool Order Variability

Tools are processed before the system prompt in Claude's cache ordering. If the list of tools passed to each request varies (different order, different subset), the cache will miss on every request:

# BAD: dict ordering may vary between Python versions or serialization
tools = get_enabled_tools_for_user(user_id)  # Returns a different order each time

# GOOD: normalize tool order
tools = sorted(get_enabled_tools_for_user(user_id), key=lambda t: t["name"])

3. Confusing 1-Hour TTL Economics

The 1-hour TTL ({"type": "ephemeral", "ttl": 3600}) costs 2× base input price on write (vs. 1.25× for 5-minute). This is only worth it if requests are spaced more than 5 minutes apart but still within the hour. For high-frequency workloads (requests every few seconds), stick with the 5-minute TTL, since the cache will stay warm without the extra write cost.

4. Concurrent Request Race Conditions

If many requests arrive simultaneously before the cache is warm, all of them may generate cache misses and simultaneously write to the cache, paying the creation cost multiple times. The fix is to pre-warm the cache before opening traffic (see the pre-warming section above), or to use a short-circuit lock at the application layer for the first request.

5. Ignoring the 20-Block Lookback

For very long conversations (50+ turns), Claude's cache matching only looks back 20 content blocks. If you're appending conversation turns as individual content blocks, the stable "system context" blocks that come before the 20-block window won't cache as expected. Keep your system prompt in the system parameter (not in the messages array) to ensure it's always in scope for caching.

Cache-Aware Model Routing: Why "Use the Smaller Model" Is Often Wrong

Most cost-optimization advice for multi-model setups boils down to: "route easy tasks to a cheaper, smaller model." That's usually right, but it ignores caching, and once you account for the prompt cache, the naive version of that rule can actually increase your bill.

Suppose you have a long, stable context (a big system prompt, a codebase, or a RAG payload) already cached on a large model. A new task arrives that a smaller model could handle. The naive router sends it to the small model to "save money." But routing to a different model means:

The small model has a cold cache for that context, so you pay full price to process the entire prefix again (no cache read discount).
On the large model, that same prefix would have been served at ~10% of input price as a cache read.

When the cached context is large, a cache read on the expensive model is frequently cheaper than a full-price prefill on the cheap model. Ten percent of a big number beats one hundred percent of a slightly smaller number. So the "cheaper model" ends up costing more.

This is exactly the calculation AIAgentCostSaver makes automatically. It does cache-aware routing: it doesn't just look at whether a task is simple enough for a smaller model, it also looks at where the context is currently cached and how large it is. If the context is already warm on a larger model and big enough that the cache-read savings dominate, it keeps the request on the larger model, because that's cheaper overall even though a smaller model could technically do the job.

It only switches down to a smaller model when the math actually favors it (and the task is suitable), which happens in cases like:

New conversation: there's no warm cache to preserve, so no cache-read advantage is lost by starting fresh on a smaller model.
Warm large-model cache, but a very small conversation: the cached prefix is tiny, so the cache-read savings are negligible and the smaller model's lower per-token price wins.
Cache has expired: provider-side KV cache is short-lived (the default TTL is only about 5–10 minutes), so once it lapses there's nothing to preserve, and the next request pays a full prefill regardless of model.

TL;DR: Key Takeaways

KV cache is automatic and built into transformer inference. It speeds up token generation by reusing computed attention matrices for tokens already processed in the same request.

Prompt caching / prefix caching is opt-in at the API level. It persists the KV cache across separate API requests by caching the computed prefix. The savings kick in when many requests share the same large prefix (system prompt, tool definitions, RAG context).

Semantic caching is your own infrastructure layer. It returns cached LLM responses for semantically similar queries using embeddings + vector search. Best for FAQ-style workloads with high query repetition.

To maximize cache hit rate:

Put stable content first: tools → system → few-shot examples → conversation history
Put dynamic content last: user message, timestamps, per-request variables
Use cache_control markers explicitly (Claude) or trust auto-detection (OpenAI)
Avoid silent invalidators: datetime.now(), non-deterministic JSON, varying tool sets
Normalize everything that's supposed to be stable

To measure ROI:

Track cache_read_input_tokens vs. input_tokens to get your hit rate
Calculate actual cost vs. uncached baseline using provider pricing
Monitor TTFT (time to first token) split by cache hit/miss to measure latency wins
Alert if hit rate drops below your expected baseline (often a sign of a silent invalidator)

LLM caching isn't free: there's a write premium, infrastructure overhead for semantic caching, and engineering effort to structure prompts correctly. But for any production system making frequent calls with large, stable prompts, the ROI is typically 5–10× in cost reduction and meaningful latency improvement. It's one of the highest-leverage optimizations available to developers building on top of LLMs today.

Frequently Asked Questions

How does LLM caching work?

LLM caching reuses computation or results from earlier requests instead of redoing them. It happens at three layers: the KV cache stores per-token attention keys and values inside the model so generation doesn't recompute the past; prompt (prefix) caching persists that KV cache on the provider's servers so a repeated prompt prefix is skipped across separate API calls; and semantic caching stores full LLM responses in your own vector store and returns them for queries that are semantically similar. The first two save prefill compute and input-token cost; the third can skip the LLM call entirely.

What is the difference between KV cache and prompt caching?

The KV cache lives in GPU memory during a single request and is automatic, you never configure it. Prompt caching (prefix caching) is the provider persisting that KV state between separate API requests so a shared prefix (system prompt, tools, RAG context) isn't recomputed each time. Put simply: KV cache makes one request fast; prompt caching makes repeated requests cheap.

What is prefix caching in LLMs?

Prefix caching is prompt caching by another name. The provider matches the longest common prefix of your prompt against what it has already processed, reuses the cached keys and values for that prefix, and only computes tokens after the first point of difference. It's why stable content belongs at the front of the prompt and volatile content at the end.

What is semantic caching, and when should I use it?

Semantic caching embeds each incoming query, searches a vector store for a previously answered query above a similarity threshold, and returns the stored response on a hit, skipping the LLM. Use it for high-repetition, factual, FAQ-style workloads. Avoid it (or set a very high threshold) when small wording differences change the correct answer, or when responses depend heavily on live, per-request context.

How long does a prompt cache last?

Provider-side caches are short-lived. The default TTL is roughly 5 to 10 minutes of inactivity, refreshed each time the prefix is reused. Anthropic also offers an opt-in 1-hour TTL at a higher write price. Because the window is short, caching pays off most when the same prefix is hit frequently.

Does prompt caching change the model's output?

No. Prompt caching reuses the exact keys and values the model would have computed anyway, so a cache hit produces the same result as a cache miss. It only affects cost and latency, not the tokens the model sees or the distribution it samples from. (Semantic caching is different: it returns a previous response for a similar query, so it can change what the user gets.)

Why isn't my prompt cache being hit?

Almost always a "silent invalidator" near the front of the prompt: a datetime.now() timestamp, a per-request UUID, non-deterministic JSON key ordering, or a tool list whose order changes between requests. Any byte change invalidates the cache from that point forward, so a moving value at the top defeats everything after it. Sort JSON keys, pin tool order, and keep volatile data at the end. Confirm hits by reading cache_read_input_tokens (Claude) or cached_tokens (OpenAI) in the response usage.

Do I need to change my code to use prompt caching?

It depends on the provider. OpenAI caches automatically for prompts above its minimum length, so no code changes are required. Anthropic and AWS Bedrock are opt-in: you mark the end of the cacheable prefix with a cache_control (or cachePoint) breakpoint. In all cases the bigger lever is prompt structure, keeping the stable prefix identical across requests.

Is LLM caching worth it for a low-traffic app?

Not always. Caching carries a write premium on the first request, and provider caches expire in minutes. If requests sharing a prefix arrive rarely (or in bursts with long gaps), the cache expires between hits and you keep paying to re-warm it. Caching shines when a large, stable prefix is reused frequently within the TTL window. For semantic caching, you also take on vector-store infrastructure, which needs enough query volume to pay for itself.

What Is Prompt Engineering? A Practical Guide to Context Engineering and KV Cache

Rishabh Poddar — Fri, 03 Jul 2026 04:46:54 +0000

Prompt engineering started as a narrow craft and then grew into a much bigger idea.

At first, it just meant learning how to write better instructions so a model would give better answers. That is still part of the job. But once people started building real AI agents, the question got bigger. It was no longer just about the wording of the prompt. It became about what context the model sees, what tools it can use, what history it should remember, and how to keep all of that stable enough to run fast and cheaply. That is where context engineering comes in.

The short version

To put it directly, prompt engineering focuses on writing the instruction itself, while context engineering shapes the entire environment around that instruction. A well-written prompt gets you a better initial response, but context engineering is what makes the whole system work reliably. If you are building production workflows, this distinction directly affects quality, cost, latency, and safety.

What prompt engineering actually is

Prompt engineering is the practice of writing instructions that help an AI model do a task well. This process involves giving the model a role, defining the task clearly, adding examples, setting constraints, specifying the output format, and telling the model what to avoid.

For example, a basic prompt might ask a model to summarize a support ticket in three bullets and end with a recommended next step. A stronger prompt would add tone, audience, formatting rules, and edge cases.

Even with advanced systems, clear instructions still matter because a vague prompt wastes time no matter how good the model is. Clear instructions improve consistency and reduce guesswork. But the moment your system starts using tools, memory, retrieval, approvals, or multi-step workflows, prompt engineering alone stops being enough.

What context engineering means

Context engineering is the broader discipline of deciding what information the model should see when it runs, which includes the system prompt, user instructions, conversation history, retrieved documents, tool outputs, memory, policies, guardrails, and any hidden internal state.

Anthropic describes context engineering as the natural progression of prompt engineering, and that framing is helpful. While prompt engineering focuses on how to phrase an instruction, context engineering curates the entire set of tokens the model receives, shifting the question from how to ask something to what the model needs to know at any given moment.

Prompt engineering vs context engineering

Think of prompt engineering as the content inside the context window, whereas context engineering is the process that decides what fills that window in the first place.

Prompt engineering is still useful for single-turn tasks, classification, drafting, and clean one-off generation. Context engineering is what you need when the model has to operate as part of a system. If the model is reading docs, calling tools, remembering prior steps, or responding to dynamic data, you are doing context engineering whether you call it that or not.

This explains why the conversation has shifted. In early LLM apps, the bottleneck was often prompt wording. In modern agent systems, the bottleneck is usually context quality.

Why this matters for AI agents

Agents create context pressure very quickly. Every tool call adds output. Every retrieved document adds text. Every decision adds history. Every retry adds more tokens. Before long, the model is carrying around a lot more than the original prompt. Systems often become brittle when developers keep stuffing everything into the prompt and hoping the model will sort it out, which usually creates noise instead of clarity.

Good context engineering keeps the model's working set small, relevant, and stable.

To address this, platforms like teamcopilot.ai rely on reusable workflows and skills. If the agent needs the same procedural knowledge every time, that knowledge should not be rebuilt from scratch in every session. It should be reusable, predictable, and loaded only when needed.

KV cache changes the economics

While prompt engineering focuses on wording, KV cache optimization is about structure. Modern model providers cache the internal attention state for stable prefixes, which means if the beginning of your prompt stays the same across requests, the provider can often reuse work instead of recomputing it from scratch. That matters because it reduces both latency and cost.

The key idea is simple: keep the stable part of the prompt stable. If you keep changing the system prompt, adding timestamps near the top, or inserting request-specific data into the prefix, you break cache reuse. The model then has to rebuild more of the prompt from scratch.

The practical rule: put dynamic content at the end

One of the most useful production habits is to place dynamic content at the end of the prompt whenever possible. The reason is straightforward: the more stable the leading prefix is, the easier it is for the model provider to reuse cached computation. If the early part of the prompt changes all the time, the whole prefix becomes harder to reuse.

So the general pattern should look like this:

Start with a stable system prompt.
Place shared instructions right after it.
Reusable policy and format guidance should sit near the top.
Insert request-specific data later in the sequence.
Keep the most dynamic bits at the very end.

While prompts don't need to be identical, prioritize prefix stability over clever wording. For agentic workloads, this can make a real difference. The research and practitioner guidance around prompt caching keeps landing on the same point: stable prefixes are cheaper and faster, while dynamic data belongs later in the prompt or outside it entirely.

A better prompt structure

If you are building something real, a good prompt often has four layers:

1. Core instruction

This stable foundation tells the model what role it is playing and what overall standard it should follow.

2. Policy and format

Also kept stable, this layer includes guardrails, style, output shape, and any constraints that should remain the same across runs.

3. Reusable context

Sitting in the middle, this layer might include retrieved knowledge, a task brief, or structured memory that is relevant for a whole class of requests.

4. Dynamic request data

Placed at the very end whenever possible, this section contains the user's current request, a timestamp, a customer ID, a ticket number, or the latest tool output.

This layering helps both quality and caching. It keeps the stable prefix stable, which gives the infrastructure a chance to do its job.

What not to do

There are a few common mistakes that create needless pain. To keep your system running smoothly, avoid these practices:

Avoid putting user-specific values into the top of the system prompt unless absolutely necessary.
Keep JSON fields in a consistent order if the model sees that JSON as part of the prompt.
Never bury dynamic values in the middle of a long, stable instruction block.
Stop appending giant tool transcripts when only the latest result actually matters.
Do not treat every request like a brand-new prompt if most of the instruction never changes.

These mistakes are small in isolation, but they quickly add up at scale.

Context engineering in practice

In a real agent, context engineering is usually about making tradeoffs. Instead of feeding the model everything, you want to select only the most relevant information, which means deciding which documents to retrieve, which memory to keep, which tool output to include, how much history to preserve, whether to summarize older turns, and what to keep out of the prompt entirely.

This is the heart of the difference between a toy demo and a production system. While prompt engineering makes a model sound smarter, context engineering is what makes it truly useful.

How teamcopilot.ai fits in

This is exactly the sort of problem teamcopilot.ai is built to help with. When teams are running reusable workflows, they usually want the instructions to be consistent, the risky steps to be gated, and the context to stay manageable. That is much easier when the system is designed around reusable skills, permissions, and controlled execution instead of one-off prompts scattered everywhere.

If you want a related read, MCP vs Skills: Why Skills Save Context Tokens is a good companion post. It covers the same basic idea from another angle: less unnecessary context, more deliberate reuse.

You may also like Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails and Why Your AI Agent Should Never See Your API Keys. Those posts are really about the same production problem: keep the agent useful, but keep the dangerous parts under control.

A simple production checklist

If you are building an AI workflow today, use this as a quick check:

Keep your system prompt stable.
Put dynamic data at the end of the prompt.
Strip out anything the model does not need.
Summarize old context instead of endlessly appending it.
Retrieve only the most relevant information.
Separate instructions from data.
Use approvals for high-risk actions.
Reuse skills or templates when the same procedure repeats.

That checklist is not fancy, but it works.

The deeper lesson

The real shift is that prompt engineering has evolved from phrasing into systems design.

Building agents that read, retrieve, act, and iterate means designing an information flow rather than just writing prompts. Deciding what enters the model, when, and how consistently is what defines context engineering. And when you care about latency and cost, it also becomes cache engineering.

FAQ

What is prompt engineering?

Prompt engineering is the practice of writing instructions that help an AI model do a task well. It covers role setting, examples, constraints, and output formatting.

What is context engineering?

Context engineering is the broader job of deciding what information the model should see, including prompts, memory, retrieved data, tool output, and policy.

Is context engineering replacing prompt engineering?

No. Prompt engineering is still part of the job. Context engineering just expands the scope to include the whole environment around the prompt.

Which matters more for AI agents?

Context engineering usually matters more for agents, because agents depend on retrieval, tool use, state, and repeated steps. A good prompt helps, but good context keeps the system usable.

Why does prompt structure affect KV cache?

Because caching works best when the prefix stays stable. If you keep changing the beginning of the prompt, the provider has less reusable computation.

Should dynamic data always go at the end of the prompt?

Usually, yes. If the data is request-specific, placing it near the end helps preserve a stable prefix. There are exceptions, but this is a strong default.

What counts as dynamic data?

This includes timestamps, request IDs, user-specific details, session state, latest tool output, and other values that change often.

What should stay stable in a prompt?

Your role instruction, policies, formatting rules, and reusable task guidance should stay as stable as possible.

Does prompt caching always help?

It does not always help, but it is most effective when a large chunk of the prompt repeats across requests. If every prompt is unique and short, the benefit is much smaller.

What is the biggest mistake teams make?

They overstuff the prompt with everything they know. That hurts clarity, weakens caching, and makes the model carry more context than it needs.

How does this apply to AI workflows?

Use a stable instruction layer, keep dynamic inputs separate, and make the workflow reusable. That is a much better foundation than one giant prompt copied around by hand.

How does teamcopilot.ai help here?

teamcopilot.ai helps teams build reusable AI workflows with shared instructions, permissions, and controlled execution. That makes it easier to keep context consistent without turning every prompt into a one-off experiment.

What is the difference between prompt engineering and prompt caching?

Prompt engineering is about making the instruction effective. Prompt caching is about making repeated prefixes cheaper and faster to process.

Is context engineering just retrieval augmented generation?

No, retrieval is only one part of it; context engineering also includes memory, tool outputs, policies, history management, and what you intentionally leave out.

What should beginners focus on first?

Start with clear prompts, then learn to separate stable instructions from dynamic inputs before moving on to retrieval, memory, and caching.

What is the simplest rule to remember?

Keep the stable stuff stable, and put the changing stuff at the end.

Agentic AI vs. Generative AI: What's the Difference?

Rishabh Poddar — Wed, 01 Jul 2026 11:18:43 +0000

People often use agentic AI and generative AI as if they mean the same thing. They do not.

Generative AI is there to create output. Agentic AI is there to finish a goal. One writes, summarizes, drafts, and transforms. The other plans, calls tools, checks results, and keeps going until the job is done or the workflow should stop.

That difference looks small on paper, but it changes almost everything about the system. It changes how much context the model needs, how state is handled, how failures are recovered, and how much trust you can place in the system without a human review step.

If you want a broader look at how these systems behave once they are running in loops, our post on What Is an Agent Loop? How AI Agents Reason, Act, and Iterate is a good companion read. This article focuses on the difference between the model that generates and the system that acts.

The short version

At its simplest, generative AI produces new content in response to a prompt, acting mostly reactively. Agentic AI, on the other hand, is proactive. It coordinates models, tools, memory, and policies to reach a specific goal. In practice, agentic systems often run generative models inside them to draft emails, summarize documents, or classify results, but the agent itself decides what happens next.

What generative AI is good at

Generative AI is the part most people saw first. You type a prompt, and the system returns text, code, an image, or some other generated output.

The core strength of generative AI is content creation. It can:

draft documents
summarize long text
write code snippets
generate images or audio
rewrite or translate content
answer questions from context

Generative AI works well when the task is bounded and the desired output can be produced in one response. It is especially useful when a human still owns the final decision.

That makes it a strong fit for drafting, brainstorming, and analysis. It is also why many teams begin with generative AI before they move into full workflow automation.

What agentic AI is good at

Unlike traditional AI, which focuses on generating answers, agentic AI aims to achieve specific outcomes.

An agentic system usually has some combination of:

a planning step
a memory or state layer
tools or APIs it can call
a loop that checks progress
rules that control when it should stop or ask for help

That means agentic AI can do things like:

research a topic across multiple sources
open a ticket, update the status, and notify the right people
inspect a repo, make a change, run a check, and retry if needed
monitor a system and escalate only when a threshold is crossed
guide a customer through a process across several steps

Its real value lies in completing tasks rather than just generating text.

For a team-focused view of why that matters in production, see AI Agent Governance: Why Identity Security Is the New Budget Line. Once an agent can act, governance stops being optional.

The technical difference

To put it in technical terms, generative AI maps inputs to outputs, while agentic AI maps a high-level goal to a sequence of actions.

Generative AI usually runs as a single call that takes context, produces an output, and stops.

Agentic AI behaves more like a control system, running a continuous loop:

Receive a goal.
Gather context.
Decide the next best action.
Call a tool or model.
Observe the result.
Update state.
Repeat until done.

That is why people talk about orchestration when they describe agentic systems. The model coordinates work instead of merely generating content.

If you want to see how the orchestration layer changes the user experience, MCP vs Skills: Why Skills Save Context Tokens is useful background. It shows how much of the system is about control surface, not just raw model output.

How they work together

The best systems combine both approaches. Generative AI is often the reasoning and language layer inside an agent, while agentic AI serves as the workflow layer around it.

For example, when a customer request comes in, the agent might first route it to support. Next, a generative model drafts the reply. Before anything is sent, the agent evaluates policy compliance and confidence levels, routing sensitive drafts to a human reviewer. Finally, the workflow logs the outcome and updates the ticket status.

That pattern is common because generative AI is good at local tasks, while agentic AI is better at managing the bigger process.

So the better question is not, ‘Which one is better?’ It is, ‘Which part of the job needs creation, and which part needs execution?’

Why the distinction matters in production

The difference matters as soon as the system touches real tools.

A generative model that writes a summary can be useful and relatively low risk. An agent that can update systems, send messages, or change permissions is operating in a completely different risk category.

That changes the design requirements:

You need access controls.
You need audit logs.
You need approval gates for sensitive actions.
You need clear stopping conditions.
You need a recovery path when the agent makes a bad choice.

Many teams start with a helpful assistant and slowly grant it more power without updating the surrounding control model. The result is uncontrolled rather than smarter automation.

The governance gap

While generative AI risk usually centers on output quality, like hallucinations or misleading text, agentic AI introduces operational risks.

When an agent operates in live systems, a bad decision can trigger immediate real-world consequences, such as sending an incorrect email, deleting files, changing permissions, or corrupting customer records.

teamcopilot.ai is built to let agents work safely within defined permissions, approvals, and audit trails, making the workflow useful without being reckless.

If you want the security side in more detail, read Why Your AI Agent Should Never See Your API Keys.

A practical comparison

Here is a quick side-by-side comparison. Generative AI focuses on creating content, whereas agentic AI is built to handle end-to-end workflows.

Under a generative model, the system reacts to prompts to produce text or code, usually finishing the task in a single pass. The primary risk here is output quality.

An agentic system starts with a high-level goal, planning and executing multiple steps over time. Because it interacts with real systems, its risks are operational.

Common examples of each

What you can build with generative AI

Drafting a blog post
Summarizing meeting notes
Writing a code snippet
Translating a document
Generating product copy

What you can automate with agentic AI

Investigating and routing support tickets
Updating a CRM after a sales call
Monitoring logs and escalating incidents
Researching a topic and producing a decision memo
Running a multi-step code review workflow

Notice the pattern. Generative AI creates artifacts. Agentic AI completes processes.

Where most teams should start

Most teams should start with generative AI first, then layer agentic behavior on top once the process is stable.

Start with a narrow, low-risk workflow to prove that the output is reliable. From there, you can add workflow steps around it, introduce approvals for sensitive actions, and expand only after the system proves its reliability. This path gives you useful automation without handing broad access to an ungoverned system.

It also creates room for reuse. Once a workflow is safe and documented, the team can share it instead of rebuilding it in every chat.

Why this is becoming the default enterprise pattern

What most teams end up with is a mix of both: the model drafts the content, the agent coordinates the next steps, and the platform keeps the entire process within safe boundaries.

This division matters because enterprise teams need predictable behavior. They need clear rules to define which tasks can run autonomously, which require human approval, and which must stop if confidence drops.

That is the kind of control layer teamcopilot.ai is designed for. It lets teams build reusable workflows once, then run them with the right permissions instead of inventing a new prompt every time.

How to choose between them

When deciding which approach to use, choose generative AI for tasks like content creation, summarization, drafting, and analysis. Opt for agentic AI when you need multi-step execution, tool integration, continuous monitoring, or conditional branching.

Many real-world systems combine both, using generative models to draft content and agentic workflows to execute the subsequent decisions and actions.

The big misunderstanding

The common mistake is to think agentic AI is just a fancier prompt.

While a prompt might start an agent, the real value comes from the surrounding structure of memory, tools, and policies; without these guardrails, you just have a chat response that happens to mention a next step.

That is why the question is less ‘Can the model write?’ and more ‘Can the system safely keep working?’

What to read next

If this topic interests you, these are the best follow-ups:

FAQ

What is the main difference between agentic AI and generative AI?

Generative AI creates content in response to a prompt. Agentic AI uses models plus tools, memory, and control logic to complete a goal through multiple steps.

Is agentic AI just generative AI with tools?

Not exactly. Tools help, but agentic AI also needs planning, state, feedback, and a policy layer that decides what it can do and when it should stop.

Can a generative AI model be part of an agentic system?

Yes. In most real systems, the generative model is the reasoning or content layer inside a larger agentic workflow.

Which one is more useful for businesses?

They solve different problems. Generative AI is useful for drafting, summarizing, and analysis. Agentic AI is useful when the business wants a system to carry work forward, not just produce text.

Is agentic AI more risky?

Usually yes, because it can act in live systems. That creates operational risk on top of the normal risk of model errors.

Do agentic AI systems always need human approval?

No, but high-risk actions should. Low-risk tasks can often run automatically, while anything irreversible or sensitive should have a human checkpoint.

What kind of tasks should stay in generative AI?

Tasks where the output is the main value and a person will still make the final decision, such as drafts, summaries, translations, and brainstorming.

What kind of tasks belong in agentic AI?

Tasks with a clear goal, multiple steps, and tool use across systems, such as ticket routing, incident triage, research workflows, and operational follow-up.

Why does governance matter so much for agentic AI?

Because an agent can do something wrong, not just say something wrong. Once a system can act, permissions, logs, approvals, and revocation become part of the product.

What should a team build first?

Start with a narrow, low-risk workflow to prove that the output is reliable, then add approvals and more autonomy only when the control layer is ready.

How does teamcopilot.ai fit into this?

teamcopilot.ai helps teams run reusable AI workflows with permissions, approvals, secret handling, and audit trails, which is exactly what agentic systems need once they move beyond simple content generation.

What is the safest mental model for these two terms?

Think of generative AI as the writer and agentic AI as the worker. The writer produces, while the worker completes the process.

Can I use both in the same workflow?

Yes. That is often the best design. Use generative AI for the language and reasoning steps, then use agentic orchestration to move the work through the system safely.

Cloud AI Agents vs Local AI Agents: Which Is Better for Privacy, Cost, and Latency?

Rishabh Poddar — Mon, 29 Jun 2026 04:31:26 +0000

Cloud AI agents and local AI agents are solving the same problem from opposite ends.

Both can write code, run tools, and take actions. The real difference is where they live, what they can see, and how much of the setup you want to own.

That stops sounding abstract the moment an agent starts touching real files, opening a browser, running commands, or moving data between systems. At that point, the deployment model matters just as much as the model.

What counts as cloud vs local

A cloud agent runs in vendor-managed infrastructure. You hand it a task, it works in a remote sandbox, and it usually comes back with a pull request, a report, or a finished workflow.

That is the shape you see with tools like Devin and OpenAI Codex in cloud mode. It is also the shape you get when you run a team agent on your own hosted infrastructure, such as teamcopilot.ai on a VPS or private server.

A local agent runs on your machine or inside an environment you directly control. Claude Code is the clearest example. It works inside your terminal, sees your actual workspace, and can use the tools and files already on your system.

There is also a middle ground: local-style agents running on a cloud box you own. That is where a VPS comes in.

Why cloud agents are attractive

Cloud agents are best when you want to hand off work and come back later.

They usually need less setup. You do not have to prepare your laptop, keep a terminal open, or worry about whether your machine will sleep halfway through a task. They are also easier to isolate. A fresh sandbox is cleaner than a dev machine that has been used for 40 different side projects.

That matters for teams. A cloud agent is easier to share, easier to observe, and easier to review. The output often lands as a branch, a PR, or a completed task that someone else can inspect.

The downside is equally simple. Cloud agents do not naturally know your local environment. They may not see your private tools, custom scripts, hidden config, or the exact state of your laptop. You are also trusting someone else’s infrastructure to run the work.

That tradeoff is why cloud agents are a strong fit for:

long-running jobs
parallel work across many tasks
team workflows with review steps
code changes that should land as a PR
tasks where reproducibility matters more than local convenience

If you want a broader view of how governance changes once agents can act, read AI Agent Governance Is the New Enterprise Control Plane.

Why local agents still matter

Local agents are better when the real value is in your environment.

Claude Code is a good example because it sits close to the work. It can read your actual repository, see the files you have open, and work against the same state you are already using. That makes it very good for iterative coding, debugging, and tasks that depend on your local setup.

Local agents often feel faster because the feedback loop is tighter. You can interrupt them, redirect them, or correct them without waiting on a remote job to finish. If your work depends on machine-specific tools, internal scripts, or private credentials that live only on your device or network, local access is often the cleanest path.

The downside is that local agents are harder to scale. They depend on your machine being available. They also depend on your setup being clean enough for the agent to use. And if you want to run several jobs at once, you quickly start managing the same infrastructure problems that cloud agents hide from you.

Local agents are usually the right fit for:

active coding sessions
tight back-and-forth debugging
private environments and bespoke tooling
individual developers who want direct control
workflows that depend on local filesystem state

For a team view of how local coding agents fit into shared workflows, see How to Use Claude Code with a Team: Shared Context, Permissions, and MCP.

Cloud vs local vs VPS

If you want the blunt version, cloud is about convenience and local is about control.

Setup	Best for	Strengths	Weaknesses
Cloud agent	Delegated tasks and team review	Easy to start, clean sandbox, good for async work	Less access to local state, more trust in vendor infra
Local agent	Interactive work on your own machine	Direct access to your repo, tools, and config	Harder to scale, depends on your machine staying alive
VPS-hosted agent	A middle ground	Persistent, remote, controlled by you	You now own the uptime, security, and maintenance

A VPS is the interesting case because it gives you a lot of the benefits people want from a cloud agent, without giving up ownership of the environment.

You can run Claude Code on a VPS, keep the session alive with tmux, and treat the box like a dedicated AI workstation. The same is true for other local-first tools if you want a remote machine that behaves like your own always-on dev box. That setup works well when you want a persistent workspace, remote access from anywhere, a fixed environment for repeatable work, more privacy than a vendor sandbox, and the convenience of cloud hosting without losing control.

The tradeoff is that you have to maintain it. Patching, secrets, access controls, and uptime become your problem.

Where teamcopilot.ai fits

teamcopilot.ai fits where a team needs the agent to behave like shared infrastructure rather than an individual assistant.

If you need shared workflows, reusable skills, approvals, secret handling, and one place the whole team can work from, the hosting model matters less than the control layer around it. That is why teamcopilot.ai can fit nicely on a VPS or private cloud even when the agent itself is doing cloud-like work. In other words, the box matters. But the rules around the box matter more.

If the agent is going to touch production systems, the focus should be on what it can do, who approved it, and what gets logged, rather than where it runs. For that side of the story, Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails is the natural companion post.

And if secrets are involved, the boundary has to be tight. Start with Why Your AI Agent Should Never See Your API Keys.

Practical guidance

If you are choosing today, here is the plain answer:

Cloud agents work best for handing off tasks to review later.
If you need tight control, fast iteration, and direct access to your current environment, go with a local agent.
A VPS offers a solid middle ground, giving you remote access to a machine you fully own.

For most teams, the answer is hybrid. Use local agents for interactive work. Use cloud agents for long-running or parallel tasks. Use a VPS when you want a stable, owned environment in between.

That is where the market is going. Not because one approach won outright, but because different jobs need different levels of control.

FAQ

What is a cloud AI agent?

A cloud AI agent runs in remote infrastructure instead of on your local machine. It usually works in a sandbox and returns something you can review later.

What is a local AI agent?

A local AI agent runs on your machine or another environment you directly control. It can work with your files, tools, and repo state more directly.

Is Claude Code a local agent?

Yes. Claude Code is best thought of as a local-first agent. It runs close to your workspace and is strongest when it can see your actual development environment.

Is Devin a cloud agent?

Yes. Devin is the clearest example of a cloud-first autonomous agent. It is built to work in a remote sandbox and hand back finished work.

Is OpenAI Codex cloud or local?

It can be used in both patterns, but the cloud-agent workflow is the more obvious one. That is usually what people mean when they talk about Codex as a delegated coding agent.

Can you run a local agent on a VPS?

Yes. That is one of the best middle-ground setups. You get a persistent remote machine that behaves like your own controlled environment.

Is a VPS the same thing as a cloud agent?

Not exactly. A VPS is infrastructure you own or rent. A cloud agent is usually a product running on vendor infrastructure. A VPS can host a local-style agent, which gives you more control than a typical vendor sandbox.

Are cloud agents better for teams?

Often, yes. Cloud agents are easier to share, observe, and review. They are especially good when work should end in a PR or another reviewable artifact.

Are local agents safer?

They can be, because they stay closer to your own environment and tooling. But safety depends on permissions, secrets, approvals, and review, not just where the agent runs.

What is the biggest downside of cloud agents?

They do not naturally see your local setup, and you are trusting vendor infrastructure with your work.

What is the biggest downside of local agents?

They are tied to your machine or your own infrastructure, so setup and uptime become your responsibility.

When should I choose a VPS instead?

Choose a VPS when you want a persistent environment, direct control, and remote access without giving up ownership of the machine.

Where does teamcopilot.ai fit in this picture?

teamcopilot.ai provides a shared workspace for teams, offering centralized permissions, approvals, reusable workflows, and self-hosted deployment options.

What should I read next?

Start with AI Agent Governance Is the New Enterprise Control Plane, then read How to Use Claude Code with a Team: Shared Context, Permissions, and MCP. Those two posts cover the control layer and the team workflow side of the same problem.

Human-in-the-Loop AI Agents: Approvals, Permissions, and Audit Trails

Rishabh Poddar — Fri, 26 Jun 2026 05:02:16 +0000

Human-in-the-loop AI is a practical operating model for production systems. In this model, the AI prepares work or suggests actions, while a person checks the important steps before anything risky happens. That review is what turns AI from a confident assistant into a system you can trust in production.

That matters more as agents become more capable. A chatbot that answers a question is one thing. An agent that can send messages, touch files, change records, or trigger workflows is something else entirely. Once the system can act, three questions matter: who approved it, what could it do, and what happened after it ran? Answering these questions requires three distinct controls: approvals, permissions, and audit trails.

What human-in-the-loop AI actually means

Human-in-the-loop AI means the model does not get the final say on its own when the action matters. It can draft, rank, recommend, or prepare an action, but a person still reviews the result before execution.

In practice, that could look like this:

an agent drafts a customer reply, then a human approves it before it is sent
an IT agent prepares a config change, then waits for sign-off before applying it
a finance workflow gathers the data, then a reviewer confirms the payment or transfer

Instead of making every action manual, the goal is to keep judgment where it belongs.

Why approvals matter

Approvals are the obvious part of the system, but they are also the part teams get wrong first.

Without a real approval step, agentic workflows drift into “do it now, explain later.” That is fine for low-risk drafts. It is a bad idea for anything that touches customers, credentials, production systems, or money.

Approvals create a pause at the moment when the system is about to cross from intent into execution. That pause does a few useful things at once:

it keeps a human accountable for the decision
it reduces the chance of silent mistakes
it gives the reviewer one clear place to intervene
it makes the workflow easier to explain to security, legal, and operations teams

A good approval prompt should be specific. A vague “approve this?” is not enough. The reviewer should see what the agent wants to do, why it wants to do it, and what the impact will be if it goes wrong.

Why permissions matter even more

Approvals without permissions are only half a control system.

An agent still needs to know what it is allowed to touch before it gets to the approval step. If everything is broadly available, then the approval process becomes a thin layer on top of an overpowered system.

Good permissions keep the agent small by default. A research agent should not have the same reach as an ops agent. A workflow that drafts a message should not be able to delete records. A tool that reads data should not automatically inherit write access.

That is the same direction we discussed in AI Agent Governance Is the New Enterprise Control Plane and AI Agent Governance: Why Identity Security Is the New Budget Line. Once agents become real actors in your stack, identity and access stop being background details.

The simplest rule is still the best one: give the agent only the access it needs for the job it is actually doing.

What an audit trail should record

While approvals pause actions and permissions set boundaries, the audit trail provides the permanent record. You need to be able to answer what the agent tried to do, who approved it, what context the reviewer saw, and whether the action really happened. If you cannot reconstruct that later, you lack true governance and rely only on trust.

A useful audit trail usually captures:

the agent identity
the human reviewer or approver
the requested action
the policy or workflow that allowed it
the time of review and execution
the result of the action
any error, override, or escalation

This is especially important when the workflow touches secrets or sensitive data. If the agent is allowed to see too much, the audit trail becomes the only way to understand how a problem happened.

That is one reason Why Your AI Agent Should Never See Your API Keys matters so much. If a model can see raw credentials, the blast radius gets much bigger than most teams expect.

The common failure mode

Most bad HITL systems fail in the same way: they keep the human in the loop in name only.

The reviewer gets too much noise. The approval prompt is vague. The agent has too much access. The logs are hard to search. Nobody knows which decisions need escalation and which do not. Over time, the team starts clicking approve because it is easier than reading the context.

That is automation bias in a nutshell.

Instead of removing the human, make their job smaller and clearer.

A better pattern for teams

If you are designing a workflow from scratch, start with a simple separation:

the agent proposes
the system checks policy
a human approves the risky step
the system executes with limited scope
the action gets logged

That separation makes the logic easier to reason about. It also keeps you from copying brittle approval logic into every new automation.

This is the kind of pattern that teamcopilot.ai is built for. Teams can reuse a workflow once it has the right guardrails instead of rebuilding the same approval step over and over.

If you want a concrete example of why this matters, An AI Coding Agent Deleted a Production Database. Here's What Happened and How to Prevent It is a good reminder that fast automation without control can become expensive very quickly.

What this means for product teams

For product teams, HITL must be integrated directly into the core user experience.

If the approval step is too noisy, people ignore it. If the permissions are too broad, security pushes back. If the audit trail is weak, nobody trusts the workflow after the first incident. The best systems balance those three things so the workflow still feels fast.

That usually means building for three types of actions:

low-risk actions that can run automatically
medium-risk tasks that require a quick manual check
high-risk operations that always demand explicit, multi-step approval

Once you draw that line, the design gets much clearer.

Where this leaves the market

The industry is moving toward autonomous systems, but success belongs to systems with well-defined limits.

Governance-heavy discussions keep showing up across the market. Teams want the speed of AI, but they also want the ability to explain what happened when something goes wrong. Human-in-the-loop design is the bridge between those two needs.

FAQ

What is human-in-the-loop AI?

It is an AI setup where a person reviews or approves important actions before the system executes them.

Is human-in-the-loop AI the same as human oversight?

Not exactly. Human oversight is the broader idea. Human-in-the-loop is the workflow pattern that puts the human directly into the decision path.

Do all AI actions need human approval?

No. Low-risk actions can often run automatically. The point is to reserve human review for actions that are risky, irreversible, or sensitive.

What kinds of actions should usually require approval?

Anything that changes production systems, moves money, sends external messages, grants access, or exposes sensitive data.

Why are permissions important if I already have approvals?

Because approvals do not help much if the agent already has too much access. Permissions should shrink the blast radius before the approval step even starts.

What should an audit trail include for AI agents?

It should record the agent, the reviewer, the requested action, the policy used, the time, the result, and any override or escalation.

Can human-in-the-loop slow teams down?

It can, if the workflow is poorly designed. A good HITL system reduces friction by making the review step small, specific, and easy to act on.

How is teamcopilot.ai relevant here?

It gives teams a way to run reusable AI workflows with permissions, approvals, and control instead of treating every agent like an unbounded assistant.

What is the biggest mistake teams make with agent approvals?

They make the approval step too vague and let the agent keep too much access. That creates noise for reviewers and risk for the system.

What is the simplest way to start?

Start with one workflow, one risky action, and one clear approval step. Get the logging right, then expand from there.

Claude in Slack Explained: What Claude Tag Can Do, Benefits, and Downsides

Rishabh Poddar — Wed, 24 Jun 2026 04:32:07 +0000

Anthropic's Claude Tag looks simple at first glance. Put Claude inside Slack, let people tag it into threads, give it access to selected tools and data, and let it work in the same place the team is already talking.

The simplicity is a bit deceptive. Once an AI agent becomes a shared teammate instead of a private chat, the questions get much sharper. What can it see? Who can ask it to do work? How much memory does it keep? How do you stop it from becoming noisy, expensive, or hard to control?

This post walks through what Claude Tag does, where it is genuinely useful, where it starts to fray, and why a more model-agnostic workflow layer like teamcopilot.ai can be a better fit for teams that want tighter control.

What Claude Tag is

Claude Tag is Anthropic's Slack-native agent. According to Anthropic's announcement, you can tag @Claude into a thread, give it access to the tools and data it needs, and let it work on behalf of the channel.

Claude Tag acts as a shared presence directly inside your Slack channels. Everyone in the channel can monitor its progress, jump into the thread, and rely on the agent to maintain context over time.

Anthropic's docs make the positioning even clearer. Claude Tag is meant to catch up on messy threads, pull numbers, draft PRs, prep for calls, watch channels, and keep work moving without forcing people to switch tabs.

What it can do well

1. Work where the conversation already happens

This is the main win. Most teams already decide things in Slack. The problem is that the decision, the follow-up, the doc, and the action item end up scattered across different tools.

Claude Tag tries to close that gap. If a thread turns into a task, you can hand the task to Claude in the same place you discussed it. That cuts out the usual copy-and-paste dance.

2. Keep shared context in public view

The multiplayer part matters. A shared agent in a channel can be easier to use than a private agent hidden in one person's account because the whole team can see what was asked, what Claude did, and what is still open.

It also makes handoffs less painful. If one person leaves for the day, another person can pick up the same thread without starting over.

3. Handle repetitive coordination work

Claude Tag is strongest when the task is not deeply bespoke. Think summaries, status pulls, ticket drafting, call prep, channel monitoring, or chasing down a missing detail.

That is the kind of work teams usually tolerate in the background and never quite automate properly.

4. Add proactive behavior

Anthropic leans hard into ambient and asynchronous work here. Claude can watch, follow up, and surface things that went quiet.

That is useful when the work is more like coordination than code. It is not just answering questions. It is nudging the team forward.

Where it gets awkward

1. Slack is a constraint, not just a feature

Slack is where many teams work, but not all teams. And even for teams that do use Slack heavily, it is still only one surface.

If your work spans Slack, GitHub, docs, internal tools, and approvals, a Slack-only agent can feel like the front door to a much larger system that it does not really control.

2. Shared memory is useful and risky

Memory is a benefit until it becomes stale, noisy, or wrong.

The HN thread around the launch went straight to the obvious concerns: token usage, memory bloat, permissions, and whether a shared Slack agent can really know what should or should not be remembered. That is the right criticism. Team memory is only helpful if teams can control what gets retained and what gets ignored.

3. Permissions get complicated fast

Anthropic has a thoughtful access model for Claude Tag, including channel-scoped identities and admin-controlled access. That is better than a naive shared bot.

But the moment an agent sits in a shared channel, permissions stop being abstract. The agent has to know whose tools it can use, what data it can read, what gets logged, and what requires a human to approve.

For a lot of companies, that becomes the product.

4. Token cost is a real concern

Running a proactive, memory-heavy agent in a busy channel gets expensive quickly because every summary and follow-up consumes tokens. If the channel is busy, the costs can add up quickly. This is just a reminder that agent design is also cost design.

5. It can feel too tied to one vendor and one model

Using Claude Tag also ties you directly to Anthropic's ecosystem, which limits your ability to swap models or use different tools as your needs change.

Where teamcopilot.ai fits

teamcopilot.ai centers the workflow rather than the chat surface, giving you direct control over what runs, which tools the agent can touch, and when a human must step in. This approach makes it easier to stay transparent about what the agent is actually doing, not just what it said it would do.

It is also model-agnostic, which matters more over time than people like to admit. The best model today is not guaranteed to be the best model for every task next quarter. If your workflow layer is separate from the model layer, you keep more flexibility and less lock-in.

For teams fully committed to Anthropic's ecosystem who want a quick, Slack-native assistant, Claude Tag is a strong fit. If you need to control the underlying workflow, maintain deep transparency, and avoid vendor lock-in, teamcopilot.ai is a better choice.

A practical read on the launch

Claude Tag is not a gimmick. It is a serious attempt to make AI feel like a teammate instead of a tab.

That makes it interesting, but it also highlights why the limitations matter.

Once an agent becomes multiplayer, the hard problems show up faster. Memory, permissions, and auditing all become much more difficult. And if the agent is buried inside one chat app, the lock-in question becomes impossible to ignore. This doesn't make Claude Tag bad, just honest.

If your team lives in Slack and wants a fast way to delegate work, it is worth trying. If your team needs more control than that, a workflow-first system like teamcopilot.ai is probably the better long-term bet.

FAQ

Is Claude Tag the same as Claude in Slack?

Basically yes. Claude Tag is Anthropic's newer Slack-native way to bring Claude into a team channel as a shared agent.

What is the main benefit of Claude Tag?

It keeps work inside the thread where the conversation already happened. That makes it easier to assign tasks, get summaries, and keep context visible to the whole team.

What can Claude Tag actually do?

It can summarize threads, pull data, watch channels, draft responses, prepare call notes, open PRs, and generally handle the coordination work that usually gets lost between messages.

Is Claude Tag only for engineers?

No. Anthropic is clearly aiming at broader team use. Support, ops, product, sales, and admin workflows all fit the pattern if the work lives in Slack.

What are the biggest downsides?

The biggest ones are Slack lock-in, token cost, permission complexity, and the risk of letting a shared agent remember too much from too many threads.

Is Claude Tag safe for sensitive company data?

It is safer than a loose chatbot because Anthropic built admin-scoped identities and access controls around it. But safety still depends on how carefully the workspace is configured and what data you expose to the channel.

Why do people worry about token usage?

Because a proactive, memory-heavy agent can generate a lot of traffic in a busy workspace. Every extra summary, follow-up, and context refresh costs tokens, so the real bill depends on how the team uses it.

Could Claude Tag replace a workflow platform?

Not really. It is best thought of as a powerful interaction layer. A workflow platform handles more of the orchestration, approvals, branching logic, and auditability behind the scenes.

When should I choose teamcopilot.ai instead?

Choose teamcopilot.ai if you want the agent to run controlled workflows across tools, stay model-agnostic, and make approvals and execution paths more explicit.

Who should choose Claude Tag versus teamcopilot.ai?

Teams deeply embedded in Slack who want a fast, collaborative assistant for daily coordination will get the most out of Claude Tag. On the other hand, teams that need reusable automations, strict governance, and independence from a single chat interface will find teamcopilot.ai a better fit.

Should I use both?

Sometimes, yes. Claude Tag can be the front door for quick team interaction, while teamcopilot.ai handles the more controlled automation behind it.

What should I watch out for before rolling out a tool like this?

Start with access, logging, and approval paths. If you cannot explain what the agent can touch, who can invoke it, and how to review its actions, you are not ready to scale it.

How should a team make the final decision?

Claude Tag is a real step toward shared, multiplayer AI work. It is useful. It is also opinionated. If that fits your team, great. If not, teamcopilot.ai gives you a cleaner way to keep the model separate from the workflow and the workflow separate from the chat surface.

MCP vs Skills: Why Skills Save Context Tokens

Rishabh Poddar — Mon, 22 Jun 2026 09:54:11 +0000

MCP is useful, but most of the time you do not actually need it. It gives an agent a clean way to discover tools, call APIs, and work with external systems. In practice, a skill file can describe the same usage path without dragging the whole MCP surface into context.

But MCP is not free; rather than MCP itself, the real issue is the habit of loading a big MCP surface into every session, no matter what the session is actually about. Once a Claude Code or Codex run pulls in a bunch of servers, the model sees those tool definitions right away, even if the job is just writing docs or fixing a small bug. That is where the waste starts.

The hidden cost of always-on MCP

Every MCP server brings metadata with it: tool names, descriptions, argument schemas, nested parameters, enums, examples, and sometimes prompts or resources. While useful, this is still context.

If you connect a handful of lightweight tools, the overhead is annoying but manageable. If you connect a real stack of services, the cost compounds fast.

In practice, you end up paying for:

tool discovery before the task starts
schema text the model may never use
repeated loading across unrelated sessions
extra context pressure that pushes out the actual work

That last point matters more than people think. Context acts as the active working set the model uses to reason. The more of it you burn on static tool catalogs, the less room you have for the user request, the repo state, prior reasoning, and the actual answer.

Anthropic has already written about this problem directly in the context of MCP. Their engineering post on code execution with MCP calls out tool-definition bloat and shows how direct tool calls can consume a lot of context before the model even starts doing the real job. The tool list is not just setup noise; it is part of the session cost.

Why skills are cheaper

Skills take a different path. A skill file keeps the always-loaded portion tiny. Usually that means just the skill name and a short description in the frontmatter. The detailed instructions stay in SKILL.md and only load when the model actually needs them. This progressive disclosure is the whole trick:

The model sees a lightweight skill name and description up front.
If the task matches, it loads the skill file.
If the skill needs supporting files, those are read only when needed.

For repeated operational knowledge, that is a much better tradeoff than dumping a full MCP tool surface into every session. You get the guidance when it matters, and you do not spend tokens on it when it does not.

This is why skills are a better default for:

team-specific procedures
prompt templates
review checklists
internal conventions
reusable task instructions
“how we do this here” knowledge

They are not trying to be live integrations. They are trying to be cheap, reusable context.

Skills can replace the MCP layer

Skills are for instructions, decision-making, and the actual usage pattern, while MCP is usually just extra protocol surface. In practice, that means skills can replace MCP for the part humans actually interact with. The model does not need a full tool catalog in context just to know how to use a service.

If the agent needs to use a database, hit a SaaS API, or make authenticated requests in real time, the skill can still describe the flow clearly and keep the model on the narrow path it needs.

If the agent just needs to know how your team wants it to behave, a skill is the better shape. Most of the time, that is the whole job.

The mistake is to keep a heavy protocol layer around when a skill file can do the same job with far less context.

A simple rule

Use skills by default.

Treat MCP as optional, not foundational.

That sounds obvious, but a lot of agent setups blur the line. They stuff every possible tool into every session, then wonder why the model gets slower, more expensive, and harder to steer.

What this looks like in practice

If you have a service that exposes 40 or 50 MCP tools, it might be fine for a developer who uses it every day. But most sessions do not need all 50 tools. A lot of the time, the agent just needs one narrow procedure, such as looking up a user, updating a record, creating a ticket, or formatting a request safely.

The skill can tell the model exactly how to handle the task, what fields matter, what not to do, and which edge cases to watch for. The model does not need a giant always-on MCP tool catalog to do that well.

That is the real token saving. You stop paying for the full runtime surface when all you needed was the operating playbook.

How to convert MCP into a skill

If you have an MCP server that mostly behaves like a reusable API wrapper, you should turn the useful parts into a skill.

The easiest way to inspect what you actually need is to use MCPViewer tool.

Here is the workflow:

Open the MCPViewer tool.
Paste the MCP server URL.
Click Analyze.
Scroll down and click Download spec.
Copy the downloaded JSON.
Paste it into a SKILL.md file as the skill’s content reference.
Set the skill description to something like How to use APIs for <service name> service.

This flow extracts the useful service knowledge into a lighter, reusable skill that the model can load only when needed, rather than trying to preserve every tool forever.

If the service changes often, keep the skill narrow and update it when the API changes. If the service is stable, the skill becomes a better long-term home for the instructions than the full MCP surface.

A good pattern for teams

For most teams, the best setup is skills everywhere, using skill files for the things that must be remembered:

how to format requests
how to review output
team conventions
approval rules
safe operating procedures

If a service still needs live execution, the skill can describe that path without dragging its whole protocol surface into every session. This keeps the agent lean and makes the system easier to maintain, because procedural knowledge is no longer spread across a large tool registry.

It is also easier to reason about failure. If the skill is wrong, you update instructions. If you need to change how a service is used, you update the skill. Those are different jobs, and it helps to keep them separate.

The real goal: less context waste

The problem is not just token cost in the billing sense. It is context waste. Every extra tool definition you stuff into a session is one more thing the model has to carry around while solving the actual task.

Skills let you defer that cost until the model really needs the information. They are a good fit for repeated workflows, company knowledge, and reusable operating rules.

If MCP is the transport, skills are the memory.

FAQ

Is MCP bad?

MCP is not the main problem. The problem is loading it into sessions that do not need it when a skill file would do the job with far less context.

Do skills replace MCP?

Yes, for most practical cases. If the goal is to teach the agent how to use a service, a skill can replace MCP and keep the context much smaller.

Why do skills save tokens?

Because the always-loaded part is small, the model sees the skill name and description first, then loads the full SKILL.md only when the skill is relevant.

What kind of content belongs in a skill?

Reusable instructions, procedures, checklists, formatting rules, and team-specific guidance. If the content is mostly about how to behave, it belongs in a skill.

What kind of content belongs in MCP?

Very little, unless you have a special case, as the same operational knowledge usually fits better in a skill.

Can I keep both MCP and skills for the same service?

Yes. That is often the best setup. MCP handles the runtime connection. The skill handles the playbook for using it well.

Why use mcpview.teamcopilot.ai?

Because it lets you inspect the actual MCP surface before you decide what should stay as MCP and what should become a lighter skill. That makes the conversion less guessy.

What if the MCP spec changes?

Update the skill the same way you would update any other documentation or wrapper. If the API changes often, keep the skill narrow so maintenance stays easy.

What is the best short description for a converted skill?

Something specific and boring, such as How to use APIs for <service name> service. This pattern tells the model exactly what the skill is for without wasting words.

Sakana AI's Fugu Explained: How the Multi-Agent Model Orchestrates Frontier LLMs

Rishabh Poddar — Mon, 22 Jun 2026 05:23:12 +0000

Sakana AI's Fugu is a good example of where the industry is heading.

Instead of trying to win with one massive model, it coordinates a pool of strong models well. On the surface, Fugu is presented as a single API, but under the hood, it behaves like a learned manager that routes tasks, chooses roles, and stitches together the output of multiple frontier models. This makes Fugu a multi-agent orchestration system delivered as a single model, rather than just a chatbot with a nicer prompt.

A lot of the messy work in production AI comes from orchestration: choosing the right model, deciding when to verify, splitting a task into subtasks, and avoiding expensive calls when a cheaper one will do. Fugu turns that problem into the product.

What Fugu actually is

Sakana AI describes Fugu as a multi-agent system as a model. You send one request to a single endpoint, and Fugu decides how to distribute the work across a pool of specialist models.

That pool is not locked to a single vendor. The system can dynamically assemble agents, coordinate them, and even let users opt out of specific models or providers to fit privacy, data, or compliance requirements. The goal is to keep the API simple while making the backend coordination much smarter than a hand-built router.

There are two public variants:

Fugu, which balances latency and quality
Fugu Ultra, which uses a deeper pool of agents for harder tasks

This split is useful because not every task deserves the most expensive path. A lot of day-to-day coding, review, and internal support work needs a fast default. More difficult tasks, like deep reasoning, paper reproduction, or security analysis, can justify a heavier orchestration setup.

How it works

The basic workflow is different from a normal single-model call. First, the incoming task is routed into a learned coordination process. Fugu decides which agents should participate, what role each one should play, and how the exchange should proceed. The system learns collaboration patterns that are not obvious to a human operator, but work well in practice.

Fugu is grounded in two ICLR 2026 papers: TRINITY and Conductor. TRINITY uses a lightweight evolved coordinator that assigns roles like Thinker, Worker, and Verifier across a multi-turn task. Conductor learns natural-language coordination strategies with reinforcement learning. Together, they show that instead of hand-designing every workflow, you can train a system to discover how to orchestrate other models. This points to a broader shift: while the last wave of AI progress focused on making single models stronger, this wave is about making model systems smarter.

Why the orchestration layer matters

Most teams already know that different models are good at different things. While one model might excel at code, others are better suited for long reasoning or factual retrieval. In a hand-built stack, someone has to decide when to call which model, how to verify the output, and when to stop paying for more inference. Fugu tries to learn those decisions instead of hard-coding them.

This approach improves cost-performance. If the system can route easy subtasks to lighter agents and reserve heavier agents for the hard parts, the overall result can be better than sending every request to the most expensive model in the pool.

It also improves reliability. A lot of failures in agentic systems happen because orchestration is brittle. When one model does everything, a single mistake ripples through the whole chain. Fugu's design reduces that risk by using specialists and verification roles more deliberately.

Fugu versus Fugu Ultra

The difference between the two variants is mostly about how much orchestration you want to pay for.

Fugu is the balanced option, designed as the practical default for coding, interactive work, and general workloads where latency still matters.

Fugu Ultra goes further, with Sakana positioning it for more complex, high-stakes, multi-step work where answer quality matters more than speed. The examples they highlight include paper reproduction, Kaggle competitions, security analysis, literature review, and patent research.

This framing shows what the product is really for. Fugu is not just a better chat model; it is a system for tasks where the model has to reason, delegate, verify, and even disagree with itself before it answers.

What the benchmarks suggest

Sakana reports strong performance across coding, reasoning, science, and agentic benchmarks. Fugu and Fugu Ultra compare well with publicly available frontier models, sometimes sitting right alongside or ahead of them.

The benchmarks they call out include:

SWE-Pro for coding
TerminalBench for terminal and tool use
LiveCodeBench and LiveCodeBench Pro
Humanity's Last Exam for hard reasoning
GPQA-D for scientific reasoning
SciCode
Long-context reasoning
MRCRv2

The exact numbers matter less than the pattern. Rather than claiming to be a single monolithic model, Fugu demonstrates that orchestration itself can produce frontier-level results on difficult tasks.

Their qualitative examples make that point even more clearly. Sakana shows Fugu on tasks like autonomous research, classical Japanese reading-order recovery, Rubik's Cube solving, CAD generation for a mechanical iris, blindfold chess, and trading simulations. These environments are very different, but they all reward a system that can choose the right internal strategy instead of guessing once and hoping for the best.

The product details that matter

Fugu is delivered through an OpenAI-compatible API, which means teams do not need to rebuild their integration layer to try it. If you already have a client, a harness, or an internal agent stack that talks to an OpenAI-style endpoint, Fugu slots in without much friction.

Sakana offers both subscription and pay-as-you-go plans. The pay-as-you-go model avoids stacking fees across every model in the pool; you pay a single rate based on the top-tier model involved in the configured pool. This makes orchestration financially viable instead of prohibitively expensive.

One limitation: Fugu is not yet available in the EU/EEA while Sakana works toward compliance.

Why this is a bigger product than it looks like

At first glance, Fugu sounds like a very good router, but that description undersells it. The deeper idea is that model orchestration itself is becoming a first-class capability. If that holds, the value is not only in better benchmark scores, but in turning a pile of expensive, specialized models into a single system that a team can use without hand-tuning workflows from scratch.

The system is useful for real teams because it hides just enough complexity to make multi-model workflows practical.

There is also a strategic angle. Relying on one provider for every critical task is a risk. A learned orchestration layer that can route around constraints, swap agents, or exclude a provider reduces that dependency. Sakana is clearly leaning into that idea.

Where teamcopilot.ai fits

teamcopilot.ai is a shared control layer for AI workflows, permissions, and approvals. That makes it a natural fit for a system like Fugu. If Fugu is the orchestration engine for a task, teamcopilot.ai is the governance layer around it. You can route work through reusable workflows, keep approvals visible, and decide who can do what before the model ever touches the task. Production AI requires making models safe, repeatable, and shareable across a team.

The tradeoffs

Fugu is impressive, but it has tradeoffs. Latency will always be part of the conversation when a system calls into multiple models or multiple agent steps. If you need instant responses for a live UI, a simpler single-model path may still win.

The routing logic is also proprietary. Sakana does not expose the exact internal selection process, so you get the benefits of orchestration without full visibility into every decision. Additionally, while the standard Fugu allows opt-outs, Fugu Ultra uses the full agent pool. If you need strict control over every provider in the loop, that is worth keeping in mind.

Still, these are normal tradeoffs for a new product category. The real test is whether the system earns that complexity back with better results.

The bigger takeaway

Fugu is a sign that the market is moving from single-model thinking to system thinking. That change is easy to miss if you only look at raw benchmark numbers, but the product story is clear. Sakana AI is betting that the most useful AI systems will be coordinated pools of models, with a learned layer deciding how to use them. Many teams are already heading in this direction manually, and Fugu simply makes the orchestration layer explicit.

FAQ

What is Sakana Fugu?

Sakana Fugu is a multi-agent orchestration system presented as a single model API. It coordinates a pool of frontier models instead of relying on one model to do everything.

Is Fugu a model or a product?

It is both. Sakana exposes it as a model API, but the real value is in the orchestration system behind it.

What is the difference between Fugu and Fugu Ultra?

Fugu is the balanced, lower-latency option. Fugu Ultra uses a deeper agent pool for harder, higher-stakes tasks where quality matters more than speed.

How does Fugu work?

It routes tasks across multiple specialist models, assigns roles, and coordinates the response. The research behind it comes from TRINITY and Conductor.

Why not just call one frontier model directly?

Because different models excel at different tasks. Fugu decides when to delegate, verify, or switch strategies instead of making one model carry the whole load.

Can I control which models Fugu uses?

Yes, for Fugu. Sakana lets you opt out of specific models or providers to fit privacy, data, or compliance needs. Fugu Ultra uses the full pool.

Is Fugu OpenAI-compatible?

Yes. It fits into existing clients and agent stacks without requiring a major integration rewrite.

What tasks is Fugu best for?

Coding, reasoning, research, security analysis, paper reproduction, and other multi-step workflows where orchestration matters.

Is Fugu good for real-time apps?

Not necessarily. The more agents you coordinate, the more latency becomes a factor, so it may not be ideal for instant responses.

Does Fugu show which underlying models it used?

No. Sakana treats the exact routing logic as proprietary.

Can teams use Fugu safely?

Yes, if the surrounding workflow is controlled. Approval layers, audit trails, and secret handling are essential for making any model safe and useful in a team setting.

Why should teams care about orchestration at all?

Because orchestration is where real productivity wins happen. Choosing the right model for the right subtask can matter as much as choosing the model itself.

Where does teamcopilot.ai fit in?

teamcopilot.ai provides a shared control layer for AI workflows, permissions, and approvals, making it easy to run systems like Fugu inside a governed, reusable process.

Will Fugu replace single-model workflows?

Not entirely. Simple tasks are still better served by a single call, but harder workflows that benefit from delegation and verification will increasingly rely on systems like Fugu.

DEV Community: Rishabh Poddar

OpenAI Launches ChatGPT Work: An AI Agent for Workplace Automation

What OpenAI actually launched

Why this matters

The part teams should pay attention to

Where teamcopilot.ai fits

What to expect next

FAQ

What is ChatGPT Work?

How is ChatGPT Work different from normal ChatGPT?

Does ChatGPT Work require approval for actions?

Can ChatGPT Work run on a schedule?

Which users got access first?

Why does ChatGPT Work matter for enterprise AI?

Is ChatGPT Work enough on its own for a team?

How does this relate to teamcopilot.ai?

What should companies watch before adopting tools like this?

Is this just another productivity feature?

What is the biggest risk with workplace agents?

What is the practical takeaway for teams?

Open Source LLMs: Why Enterprises Are Moving Beyond Frontier Models

Frontier models are still useful, just not for everything

Why enterprises move away from frontier models for routine work

What open source LLMs change

Fine-tuning on your own data

When RL-based training helps

The practical enterprise pattern

Why this matters for agents

Where teamcopilot.ai fits

The bottom line

Related reading

FAQ

What is a frontier model?

Why would an enterprise use an open source LLM instead?

Are open source LLMs always cheaper?

When should a company still use a frontier model?

What kind of tasks fit self-hosted models best?

What is fine-tuning in this context?

Is fine-tuning better than retrieval-augmented generation?

When do RLHF or RLAIF matter?

Do enterprises need to train models from scratch?

What is the biggest mistake companies make with enterprise AI?

Is self-hosting only for very large companies?

How do you decide between a small model and a frontier model?

What about compliance and data residency?

Can open source models handle enterprise language well?

How does teamcopilot.ai help with this shift?

What Is an Agent Gateway? Why It's Becoming the Control Plane for Enterprise AI

What an agent gateway does

Why this is showing up now

Why enterprises care

The market is still messy

What teams should do now

Where teamcopilot.ai fits

The bigger point

FAQ

What is an agent gateway?

Why do enterprises need one?

What problem is the market actually solving?

Is an agent gateway the same as an agent harness?

Is this just another name for API management?

How does an agent gateway help with security?

How does it help with cost?

Does an agent gateway reduce cost?

Open source or vendor platform, which is better?

How is teamcopilot.ai different?

When should a team care about this topic?

What is the biggest risk without a gateway?

What is the simplest first step?

What should I read next?

What Is an Agent Harness? The Missing Layer Between a Model and a Working AI Agent

The short version

Why the term matters now

What lives inside a harness

Managing context

How tools actually run

Memory and state

Guardrails and approvals

Verifying the output

Observability