DEV Community: Drishti Shah

n8n Best Practices

Drishti Shah — Tue, 21 Apr 2026 23:20:50 +0000

n8n works reliably in a single-developer status. The picture changes once AI Agent nodes begin calling LLMs across shared teams and environments.

Token usage is hard to track, provider keys are spread across credentials, and workflows depend on a single model endpoint. The gap is not in workflow design. It sits between n8n and the model providers those workflows depend on.

This guide covers the n8n best practices needed to run AI workflows in production, focusing on the operational control layer behind LLM calls.

What n8n is and why teams choose it for AI automation

n8n is a node-based workflow automation platform that combines a visual builder with the option to write JavaScript or Python anywhere in a workflow.

That flexibility is why it shows up quickly in production AI stacks and why many discussions about n8n best practices now include agent orchestration and execution in the same system.

For AI workflows, n8n provides:

A native AI Agent node built on LangChain that connects workflows to OpenAI, Anthropic, Gemini, Azure OpenAI, and others.
Sub-workflows for modular agent design across environments and teams.
Memory nodes for stateful, multi-step interactions.
Tool nodes for external API calls during agent execution.
An MCP Trigger node that allows workflows to be invoked from AI environments like Claude Desktop.

Access and pricing depend on deployment:

Deployment option

What teams get

|
|

n8n Cloud

Starter, Pro, Business, Enterprise tiers billed per workflow execution

|
|

Self-hosted

Free, with infrastructure and security managed by the team

|
|

Business plan

Adds SSO, Git-based version control, RBAC, isolated environments

|
|

Enterprise plan

Adds compliance support and dedicated assistance

Even well-structured workflows with retries, environments, and version control do not solve token visibility, credential management, or provider reliability. Those issues sit outside the workflow layer.

Where n8n’s Native Tooling Falls Short for LLM Operations at Scale

n8n handles workflow orchestration well. The gaps appear when AI Agent nodes begin calling LLMs across teams and environments.

At that point, five operational issues show up consistently:

1. Limited provider flexibility

n8n supports multiple providers like OpenAI, Anthropic, Gemini, Azure OpenAI, and others through credential nodes. However, switching providers still means updating workflows individually. There is no native fallback if a provider fails, and no routing across providers during rate-limit pressure.

2. Cost and token visibility

Execution history shows whether a workflow ran. It does not show token usage, per-node cost, or which team triggered model activity.

Agentic loops can make multiple sequential model calls inside one execution. TCosts accumulate without visibility. There is no per-team or per-workflow breakdown.

3. Security and compliance boundaries

Provider API keys live directly inside credential storage. Access depends on credential permissions, not model-level isolation across teams.

Prompts may include internal documents, customer records, or PII. There is no interception layer to inspect or filter what reaches the model, and no audit trail of what was sent or by whom.

4. Cost control and optimization

n8n does not enforce budget caps for model usage. A looping workflow can consume quota quickly without warning.

There is also no routing logic to send lightweight tasks to smaller models and reserve advanced reasoning models for complex steps. Teams share provider limits without execution-level prioritization.

5. Access control at the model layer

RBAC in n8n controls who can edit workflows. It does not control which models those workflows can call.

Teams typically share the same provider credentials across environments. There is no hierarchy for limiting usage by team, environment, or model class.

Across all five areas, the pattern is consistent. Workflow structure remains strong, but model access operates without a coordination layer. That missing layer is what n8n best practices need to address in production.

n8n best practices: Add an AI gateway as the LLM control layer

At this stage, n8n best practices shift from workflow design to control. The issue is not workflow design but the lack of a control layer between n8n and the providers that those workflows depend on.

An AI gateway sits between n8n’s agents and provider APIs. Every request passes through it. Routing, access control, observability, and guardrails are applied before the request reaches a model.

Model and provider switching

With a gateway, model and provider selection moves out of workflows and into configuration. Providers and models can be switched centrally. Routing logic can match task complexity to model capability. Changes apply across all workflows without modification.

Centralized credential management

A gateway stores provider credentials centrally and issues scoped access keys per team or workflow. These keys carry their own limits and permissions. Access can be rotated or revoked without touching workflow configuration.

Budget and rate limits

Budget limits cap total spend over time. Rate limits control request volume. Both apply before requests reach providers. Teams can enforce usage boundaries per workflow, team, or environment without relying on manual monitoring.

Full observability and cost attribution

A gateway logs every request with metadata including workflow ID, team, model, token usage, cost, and latency.

Usage becomes queryable across workflows and providers. Cost attribution no longer depends on reconstructing logs after the fact.

Provider fallbacks and load balancing

A gateway introduces routing logic across providers. Requests can fall back automatically when a provider fails and distribute across providers during high-volume periods. Workflows continue running without changes.

Guardrails on inputs and outputs

A gateway applies runtime checks before requests reach the model and before responses return. This includes PII detection, content filtering, and prompt injection protection. Policies apply consistently across all workflows.

Portkey provides that layer by intercepting every LLM request before it reaches a provider.

Once connected, routing rules, usage limits, guardrails, logging, and access policies are applied centrally rather than inside individual workflows.

Workflows continue to run as they are. For platform teams, model usage becomes visible, governed, and consistent across environments.

Setting this up requires a single change in how LLM calls are routed:

Add an OpenAI node in your n8n workflow
Configure credentials:
Select any model in the node. The model defined in your Portkey Config overrides the node’s default model setting.

Scaling n8n AI workflows across engineering teams

Once n8n LLM traffic flows through a gateway, platform teams can enforce policies that make usage predictable and manageable across teams and environments.

Capability area

n8n (native)

AI gateway (Portkey)

|
|

Credential management

Provider keys stored in n8n credentials

Centralized provider keys with scoped access per team/workflow

|
|

Access control

Workflow-level RBAC only

Model-level access control with scoped API keys

|
|

LLM routing

Not available

Fallbacks, load balancing, conditional routing

|
|

Budget controls

No native limits

Per-team and per-workflow budget limits and rate limits

|
|

Observability

Execution history only

Detailed logs with tokens, cost, latency, and metadata

|
|

Cost attribution

Not available

Usage breakdown by workflow, team, or environment

|
|

Guardrails

Not available

PII detection, content filtering, prompt injection protection

|
|

Provider management

Configured per workflow

Centralized configuration with multi-provider routing

|
|

Auditability

Limited visibility

Complete request and response tracking

What production-ready n8n AI workflows require next

Start by routing a small set of workflows or a single team through the gateway. Apply budget limits, enable request logging, and define fallback providers early.

With that layer in place, usage can expand across teams and workflows without losing control over cost, access, or reliability.

As AI workflows grow more complex, with longer execution paths and higher token usage, this layer becomes necessary to keep operations stable.

Refer toPortkey’s n8n integration documentation orbook a demo to see how this fits into your existing setup.

FAQs

Does connecting n8n to Portkey require changing my existing workflows?

No. Only the OpenAI node credentials change. Workflow logic, triggers, routing, and execution structure remain exactly the same after connecting Portkey as the gateway layer.

Which LLM providers can I use with n8n through Portkey?

You can use OpenAI, Anthropic, Gemini, Azure OpenAI, AWS Bedrock, Vertex AI, and others through one gateway endpoint without modifying workflows when switching providers or models.

How do I attribute LLM costs to specific n8n workflows or teams?

Portkey metadata tags attach each request to a workflow, team, or environment. Token usage, latency, and cost become searchable across providers from one centralized dashboard.

What happens to my n8n workflows if the primary LLM provider goes down?

Fallback targets defined in Portkey Configs automatically reroute requests to another provider. Workflows continue running without credential updates or execution changes inside n8n.

Can I set different model access rules for different teams using n8n?

Yes. Scoped Portkey API keys allow separate model access, rate limits, and budget controls per team without exposing provider credentials or modifying existing workflow logic.

OpenAI Codex best practices

Drishti Shah — Mon, 20 Apr 2026 21:26:50 +0000

Codex works well for one developer. The moment it scales to an enterprise team, the operational gaps show up fast: no visibility into who's spending what, API keys scattered across machines with no clean revocation path, no isolation between teams sharing the same provider capacity, and no fallback when OpenAI's API goes down.

None of these problems is visible in individual developer workflows. All of them surface fast in org-wide deployments.

This guide focuses on what changes at that point: the operational controls required to run Codex safely and predictably across teams.

How teams access OpenAI Codex today

OpenAI Codex is a coding AI Agent that teams can access through two modes currently:

Access mode

How it works

What it means operationally

|
|

Subscription (Plus, Pro, Business, Edu, Enterprise)

Usage included in plan; shared credit pool across the workspace

No per-developer breakdown; Business plans can purchase additional credits; Enterprise provides a shared credit pool

|
|

API
(Responses API)

Pay-per-token; usage tracked at the API level

Programmatic control, but every developer holds a raw provider key

Subscription mode ties usage to ChatGPT account management with limited attribution. API mode gives programmatic control but introduces key sprawl at the team scale.

Where OpenAI Codex best practices break down at scale

The moment Codex moves from one developer to many, the gaps that OpenAI Codex best practices are meant to address start showing up across every team:

Cost and token visibility become difficult to interpret. A single Codex task can trigger dozens of underlying model calls, making it unclear how usage maps back to a developer, team, or workflow. Enterprise plans provide totals, not per-team breakdowns. At month end, there is no clear answer to “which team spent what.”
Credential sprawl introduces uncertainty. Provider keys are spread across local files, shell environments, and internal channels, with no clear ownership or lifecycle. Over time, it becomes unclear who has access, where keys are stored, and what they are being used for.
Reliability is tied to a single provider. When requests fail or degrade, workflows stall with no alternative path. Teams that need to route through different providers for compliance or cost reasons have no consistent way to do so.
Access control and team isolation become difficult to enforce. Usage from one team can silently consume shared capacity allocated to another, with no clear boundary between workloads. Access is not scoped at the team or project level, and over time it becomes unclear who is using Codex, for what purpose, and at what cost.

OpenAI Codex best practices: The six operational controls every team needs

Using Portkey's AI gateway, build a centralized to control and monitor provider, users and API keys.

1. Manage credential centrally

Store provider API keys centrally instead of distributing them across developer environments. Developers should never handle raw credentials. Issue scoped API keys per team or project with defined permissions, budgets, and rate limits. This allows access to be revoked or rotated instantly without touching individual machines. The same setup should support multiple providers and models without requiring per-developer configuration.

2. Connect to multiple providers at once

With Portkey, allow your teams to switch between providers of their choice. Admin teams can centrally control access to providers and models, while adding budgets and rate limits for each.

3. Get visibility into costs and usage

With requests flowing via Portkey, you can see detailed costs and usage, down to user level. You can also add metadata to each request. The same logs are critical for debugging multi-step agent workflows where failures need to be traced across multiple requests.

4. Apply guardrails to inputs and outputs

Codex runs with developer-level access, so sensitive data can enter prompts without inspection. Apply validation to both inputs and outputs, including PII detection, content filtering, prompt injection protection, and token limits. Guardrails should be enforced at the gateway layer so they apply consistently across all developers without requiring local configuration.

5. Enforce RBAC and maintain a full audit trail

Access should map to organizational structure. Use scoped API keys to define which models, providers, and environments each team or developer can access. Maintain a hierarchy across org, team, and developer levels. This ensures teams operate independently while preserving clear boundaries and enabling centralized governance and traceability.

How to implement these controls with Portkey

As an AI gateway, Portkey sits between Codex and providers like OpenAI, Anthropic, AWS Bedrock, and Vertex AI, centralizing model access into a single system.

Cost, tokens, and latency are tracked per request. Budgets and rate limits are enforced at the team or project level. Access is scoped through API keys, and traffic can be routed across providers with fallbacks and retries configured centrally.

Developers continue using Codex as usual. For platform teams, model access becomes centralized, observable, and controlled.

To integrate Codex with Portkey, read the complete documentation here.

Running Codex as infrastructure

If you are running Codex across multiple teams, start by routing one team through a gateway, setting budget limits, and enabling request logging and fallback routing.

Once this layer is in place, expand usage across teams without losing visibility, access control, or cost predictability.

Refer to the OpenAI Codex integration docs or book a personalized demo for a walkthrough on enterprise deployment.

FAQs

What breaks first when Codex scales across teams?

Cost visibility and access control. Usage cannot be attributed after the fact, and API keys are distributed without scope or central revocation.

How do teams prevent one session from exhausting shared credits?

Set per-developer and per-team budget and rate limits at the gateway layer before issuing access. Isolate teams into separate workspaces so one group's usage cannot consume another's capacity.

How do I get per-team cost attribution for Codex usage?

Tag every request with team, project, and developer metadata at the gateway layer. Filter by tag in the observability dashboard. Without metadata tagging, attribution requires manual log analysis.

What happens when the primary provider fails?

Without a configured fallback, all in-flight sessions and new requests fail immediately with no native retry or reroute. With automatic failover at the gateway layer, requests route to an alternative provider without developer intervention.

Can I route OpenAI Codex through AWS Bedrock or Google Vertex AI?

Yes. Route Codex through a gateway that supports multi-provider configs. Switch providers by updating the provider slug in config.toml with no changes to individual developer setups.

Semantic caching thresholds and why they matter

Drishti Shah — Sat, 18 Apr 2026 22:41:45 +0000

LLM apps waste a surprising amount of time and money answering the same question over and over again because users ask it in slightly different ways. A basic cache does not help here. It only works when the input text is identical. In real usage, that almost never happens.

Semantic caching fixes this by matching meaning instead of wording. It turns each query into a vector and checks whether a similar intent has already been answered. If it has, the system returns the cached response instead of calling the model again.

🔎 AWS found that semantic caching reduced cost by up to 86% and improved latency by 88%, with cache hits returning in milliseconds instead of seconds.

However, the similarity threshold controls whether any of this works. It defines how close two queries need to be before the system reuses a response. If that threshold is too strict, most requests miss the cache. If it is too loose, the system starts returning answers that do not quite fit.

But don’t worry, this post’s got you covered. It’ll explain how semantic caching works, how to set the right threshold, and how to handle the real production issues that follow, including stale data, multi-turn context, and monitoring cache performance.

How semantic caching works

Before getting into embeddings, it’s important to separate three caching approaches that are often conflated but solve very different problems.

Type

Where it operates

What it matches

Limitation

|
|

Exact-match caching

Application layer, string key lookup.

Identical input strings only.

Near-zero hit rate on natural language. Different phrasing always results in a miss.

|
|

Prompt / prefix caching

Provider-side (OpenAI, Anthropic, etc.).

Identical prompt prefixes (shared system prompts).

No understanding of meaning. Only helps when large parts of the prompt are identical.

|
|

Semantic caching

Application or gateway layer.

Queries with similar meaning regardless of phrasing.

Requires embeddings + vector store. Accuracy depends on the threshold and the embedding quality.

All three approaches operate at different layers and can be used together.

Exact-match caching is the simplest form. The system stores a response and only reuses it if the exact same input appears again. That works well for structured data, but breaks down completely with natural language, where users rarely phrase things the same way twice.

Prompt caching is a provider feature you enable through an API setting. It reduces the cost of reprocessing shared context, like system prompts or long instructions. This improves efficiency, but it does not help with reusing answers across different user queries.

Meanwhile, semantic caching is something you build or adopt at the application or gateway layer. It prevents redundant model calls by reusing responses for queries that mean the same thing.

For example, “What’s your return policy?”, “How do I return something?”, and “Can I send this back?” would all miss an exact-match cache because the wording is different. With semantic caching, these questions map to the same intent and reuse a single stored response.

From query to cache hit in four steps

To understand how this works in practice, it helps to break down what happens at runtime:

Convert the query into an embedding: Every incoming query is first passed through an embedding model, which transforms the text into a numerical vector that represents its meaning.
Search for similar past queries: The system queries a vector store (such as Redis, Pinecone, pgvector, etc.) to find embeddings that are close in meaning. This is done using a similarity metric like cosine similarity, for example, which has a mathematical range of −1 to 1. A score of 1 means identical direction, 0 means orthogonal, and −1 means diametrically opposed. In practice, text embeddings tend to produce positive scores, but this is model-dependent behavior, not a mathematical guarantee.
Decide whether it is a cache hit: The system compares the best match against a configured similarity threshold. If the score exceeds that threshold, it is treated as a cache hit, and the stored response is returned immediately.
Handle the cache miss: If no match meets the threshold, the request is sent to the LLM. Latency at this step varies widely by model, prompt length, and whether you measure time-to-first-token or full response time. The response is then stored along with its embedding so it can be reused in future requests.

This setup introduces two components that traditional caches do not require: an embedding model to generate vectors and a vector store capable of efficient similarity search.

Choosing the right similarity threshold for your use case

Most teams assume a stricter similarity threshold is safer, but the data shows that that’s not always true.

AWS tested multiple thresholds on real chatbot queries using Claude 3 Haiku and Titan Embeddings. The results highlight how much performance depends on where you set the cutoff.

Threshold

Hit rate

Accuracy

Cost savings

|
|

0.99 (strict)

23.5%

92.1%

15.8%

|
|

0.95

56.0%

92.6%

51.9%

|
|

0.90

74.5%

92.3%

72.5%

|
|

0.80

87.6%

91.8%

84.6%

|
|

0.75 (permissive)

90.3%

91.2%

86.3%

Moving from a strict threshold of 0.99 to a more permissive 0.75 barely changes accuracy, with less than a one percentage point difference. At the same time, cost savings increase by roughly 70 percentage points. This means you can allow significantly more cache reuse without meaningfully degrading quality for general chatbot scenarios.

But this only holds when “similar” queries truly lead to the same answer.

In domain-specific systems, that assumption breaks. In medical or legal use cases, small wording differences can change the meaning entirely. A cached response that is slightly off can be harmful.

The same issue shows up in technical documentation. Queries like “How to fix Error A” and “How to fix Error B” may look similar in embeddings, but they require completely different fixes.

So the real risk is how much your domain can tolerate small mismatches in meaning.

How to choose your tuning threshold

A practical way to approach this is to treat threshold selection as an evaluation problem:

Start with a threshold between 0.90 and 0.95. This range gives you a strong baseline where reuse is meaningful without introducing obvious errors. From there, build a validation set that reflects your real usage. Include two types of pairs. Queries that clearly express the same intent, and queries that look similar but should produce different answers.

Run these through your cache and lower the threshold gradually. At each step, track the false positive rate, which is when the system reuses a response it should not. This is the metric that matters most.

Once false positives start to exceed roughly 3% to 5%, you have reached the limit of your embedding model. At that point, adjusting the threshold further will not help. The model itself cannot reliably separate “same intent” from “similar but different.” Fixing that requires improving or replacing the embedding model.

Portkey takes a more opinionated production approach to threshold tuning. Instead of relying on arbitrary values, it recommends starting around 0.95 similarity and validating performance through backtesting on real traffic. Based on learnings from over 250 million cache requests, teams are advised to test on roughly 5,000 queries and adjust until accuracy consistently stays above 99%. In practice, Portkey applies a high-confidence threshold before returning cache hits, with tests reporting around 99% user-rated accuracy.

This differs from a fully DIY setup. Portkey does not expose user-configurable thresholds. Similarity confidence is managed internally rather than exposed as a tuning knob. Teams that want to experiment across a wider range of thresholds, like the 0.5 to 0.99 range used in benchmarks, will need a self-managed setup.

Inline CTA<br>
* {<br>
margin: 0;<br>
padding: 0;<br>
box-sizing: border-box;<br>
}</p>
<div class="highlight"><pre class="highlight plaintext"><code> body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
line-height: 1.6;
color: #333;
background: #f5f5f5;
padding: 40px 20px;
display: flex;
align-items: center;
justify-content: center;
min-height: 100vh;
}

.container {
    max-width: 1200px;
    width: 100%;
}

/* Inline CTA */
.cta-inline {
    display: flex;
    align-items: center;
    justify-content: space-between;
    padding: 30px;
    background: linear-gradient(135deg, #f5f7fa 0%, #e9ecef 100%);
    border-radius: 8px;
    flex-wrap: wrap;
    gap: 20px;
}

.cta-inline-content h3 {
    font-size: 24px;
    margin-bottom: 8px;
}

.cta-inline-content p {
    color: #666;
    font-size: 16px;
}

/* Button Styles */
.btn-primary {
    display: inline-block;
    background: #000;
    color: white;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: none;
    cursor: pointer;
    font-size: 16px;
}

.btn-primary:hover {
    background: #333;
    transform: translateY(-1px);
    box-shadow: 0 4px 12px rgba(0,0,0,0.15);
}

.btn-secondary {
    display: inline-block;
    background: white;
    color: #000;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: 2px solid #000;
    cursor: pointer;
    font-size: 16px;
}

.btn-secondary:hover {
    background: #000;
    color: white;
}

.btn-group {
    display: flex;
    gap: 16px;
    flex-wrap: wrap;
}

@media (max-width: 768px) {
    .cta-inline {
        flex-direction: column;
        text-align: center;
    }

    .btn-group {
        flex-direction: column;
        width: 100%;
    }

    .btn-primary,
    .btn-secondary {
        width: 100%;
        text-align: center;
    }
}

</style>
</code></pre></div><h3>
<a name="ready-to-get-started" href="#ready-to-get-started" class="anchor">
</a>
Ready to get started?
</h3>

<p>Create your account and start building in minutes</p>

<p><a href="https://app.portkey.ai/signup?ref=portkey.ai">Get Started</a><a href="https://portkey.ai/book-a-demo?ref=portkey.ai">Book a Demo</a></p>

<h2>
<a name="ttl-invalidation-and-keeping-cached-responses-fresh" href="#ttl-invalidation-and-keeping-cached-responses-fresh" class="anchor">
</a>
<strong>TTL, invalidation, and keeping cached responses fresh</strong>
</h2>

<p>A semantic cache can return a correct match, but an outdated answer. That’s why you need to manage when cached responses expire or are refreshed using the following three mechanisms.</p>
<h3>
<a name="timebased-expiration-ttl" href="#timebased-expiration-ttl" class="anchor">
</a>
<strong>Time-based expiration (TTL)</strong>
</h3>

<p>TTL defines how long a cached response is considered valid before it expires automatically. The right TTL depends entirely on how often the underlying data changes.</p>

<p>Short TTLs reduce the risk of stale answers but lower your cache hit rate. Longer TTLs improve reuse but increase the chance of serving outdated information.</p>

<p>A critical best practice is adding random jitter to TTL values. Without it, many entries expire simultaneously, causing a spike of cache misses and LLM calls (the “thundering herd” problem). Jitter spreads expirations over time and stabilizes system load..</p>
<h3>
<a name="eventtriggered-invalidation" href="#eventtriggered-invalidation" class="anchor">
</a>
<strong>Event-triggered invalidation</strong>
</h3>

<p>Unlike TTL, which expires entries passively over time, event-triggered invalidation actively removes cache entries in response to specific changes in underlying data or systems. This ensures the cache reflects real-world updates as they happen. </p>

<p>For example, when a pricing update occurs, all related cached responses should be immediately cleared rather than waiting for TTL expiry. This approach prevents serving outdated answers within the TTL window and is essential for highly dynamic data.</p>

<p>💡 In production systems like Portkey, cache behavior is <a href="https://portkey.ai/docs/product/ai-gateway/cache-simple-and-semantic?ref=portkey.ai"><u>controlled through TTL settings</u></a>, and requests can also use Force Refresh to bypass an existing cached response.</p>
<h2>
<a name="why-multiturn-conversations-break-naive-caching" href="#why-multiturn-conversations-break-naive-caching" class="anchor">
</a>
<strong>Why multi-turn conversations break naive caching</strong>
</h2>

<p>Semantic caching works best when queries are self-contained, but multi-turn conversations and RAG pipelines rarely behave that way. In these settings, meaning is distributed across prior messages, so caching each query in isolation can produce answers that are technically correct yet contextually wrong.</p>

<p>Consider one conversation where the system caches the query “What is the largest lake in North America?” Later, in a separate discussion about stadiums, a user asks, “What is the second largest?” A naive semantic cache may match this to the earlier geography query and return “Lake Huron.” The retrieval is semantically similar at the surface level, but completely wrong in intent because the conversational context is missing.</p>

<p>There are two main ways to address this:</p>

<ul>
<li><strong>Context-aware embedding</strong> , where the system embeds not just the latest user message but also the relevant conversational history retrieved from memory. This ensures the embedding reflects actual intent rather than isolated phrasing, which is why AWS recommends applying semantic caching to the combined query and context. The trade-off is that longer inputs generate more unique embeddings, reducing cache hit rates. As conversations grow, caching becomes less effective and may need to be bypassed.</li>
<li><strong>Query rewriting</strong> , where a lightweight model rewrites follow-up questions into standalone queries before cache lookup. For example, “What is the second largest?” becomes “What is the second largest stadium in the US?” This avoids embedding long context while preserving intent, improving reuse.</li>
</ul>

<p>RAG pipelines face the same issue because retrieved documents effectively modify the query. Caching only the raw question ignores this augmentation and risks mismatches.</p>
<h2>
<a name="monitoring-a-semantic-cache-in-production" href="#monitoring-a-semantic-cache-in-production" class="anchor">
</a>
<strong>Monitoring a semantic cache in production</strong>
</h2>

<p>The hardest problem with semantic caching is that failures are invisible by default. Unlike typical system errors, nothing breaks. A cache miss simply falls through to the LLM and returns a normal response. More dangerously, a bad cache hit returns an incorrect answer with full confidence and a 200 OK status. Without explicit instrumentation, you cannot tell whether the system is working or silently degrading. Even performance signals can mislead – cache hits are significantly faster than misses, but this latency gap is only visible if you track hits and misses separately.</p>

<p>To make these issues observable, you need cache-specific metrics:</p>

<ul>
<li><strong>The cache hit ratio</strong> provides a baseline efficiency signal, but it is not sufficient on its own. </li>
<li><strong>Distribution of similarity</strong> measures whether matches are strong or barely clear the threshold. </li>
<li><strong>The latency differential</strong> between hits and misses helps confirm that the cache is actually delivering performance gains. </li>
</ul>

<p>Most importantly, you need a way to estimate false positives, typically through sampling and evaluation using humans or LLM-based judges.</p>

<p>High hit rates can also hide serious problems. If 90% of requests are fast cache hits and 10% are slow or failing LLM calls, average latency looks healthy while real user experience suffers. To avoid this, monitor P99 latency for cache misses separately so fallback-path issues are not masked.</p>

<p>In some cases, the safest approach is to bypass caching entirely. This includes real-time queries such as inventory or pricing, safety-critical outputs, and multi-step agent workflows where one incorrect cached step can corrupt everything downstream.</p>

<p>This is where platforms like Portkey <a href="https://portkey.ai/docs/product/ai-gateway/cache-simple-and-semantic?ref=portkey.ai"><u>go beyond basic caching</u></a> by providing per-request cache status in logs and response metadata. The Logs UI surfaces statuses such as Cache hit or a miss while the analytics dashboard provides visibility into hit rate, latency savings, and cost savings, making cache performance measurable and easy to monitor.</p>

<p>⭐ One large food delivery platform handling tens of millions of AI requests used Portkey’s caching, routing, and fallbacks to <a href="https://portkey.ai/case-studies/leading-delivery-platform?ref=portkey.ai"><u>cut LLM spend by over $500K</u></a>.</p>
<h2>
<a name="putting-semantic-caching-to-work" href="#putting-semantic-caching-to-work" class="anchor">
</a>
<strong>Putting semantic caching to work</strong>
</h2>

<p>Semantic caching is a layer you add to your LLM stack, and there are multiple ways to implement it depending on how much control you want.</p>

<p>Semantic caching can be implemented across a broad tooling landscape, each option balancing control and operational overhead. Open-source libraries like GPTCache offer flexibility for teams that want full control over embeddings and storage. Cloud-managed options such as Amazon ElastiCache and Azure Cosmos DB provide scalable, production-ready infrastructure with less setup. Dedicated vector-backed caches built on Redis are also common.</p>

<p>For teams prioritizing speed and observability, AI gateways like Portkey offer managed semantic caching with built-in analytics, cross-provider support, and minimal integration overhead.</p>

<p><a href="https://portkey.ai/?ref=portkey.ai"><u>Explore Portkey’s semantic caching</u></a> to see how you can reduce cost, control quality, and ship faster without managing the infrastructure yourself!</p>

<p><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Inline CTA</title><style><br>
* {<br>
margin: 0;<br>
padding: 0;<br>
box-sizing: border-box;<br>
}</p>
<div class="highlight"><pre class="highlight plaintext"><code> body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
line-height: 1.6;
color: #333;
background: #f5f5f5;
padding: 40px 20px;
display: flex;
align-items: center;
justify-content: center;
min-height: 100vh;
}

.container {
    max-width: 1200px;
    width: 100%;
}

/* Inline CTA */
.cta-inline {
    display: flex;
    align-items: center;
    justify-content: space-between;
    padding: 30px;
    background: linear-gradient(135deg, #f5f7fa 0%, #e9ecef 100%);
    border-radius: 8px;
    flex-wrap: wrap;
    gap: 20px;
}

.cta-inline-content h3 {
    font-size: 24px;
    margin-bottom: 8px;
}

.cta-inline-content p {
    color: #666;
    font-size: 16px;
}

/* Button Styles */
.btn-primary {
    display: inline-block;
    background: #000;
    color: white;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: none;
    cursor: pointer;
    font-size: 16px;
}

.btn-primary:hover {
    background: #333;
    transform: translateY(-1px);
    box-shadow: 0 4px 12px rgba(0,0,0,0.15);
}

.btn-secondary {
    display: inline-block;
    background: white;
    color: #000;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: 2px solid #000;
    cursor: pointer;
    font-size: 16px;
}

.btn-secondary:hover {
    background: #000;
    color: white;
}

.btn-group {
    display: flex;
    gap: 16px;
    flex-wrap: wrap;
}

@media (max-width: 768px) {
    .cta-inline {
        flex-direction: column;
        text-align: center;
    }

    .btn-group {
        flex-direction: column;
        width: 100%;
    }

    .btn-primary,
    .btn-secondary {
        width: 100%;
        text-align: center;
    }
}

</style>
</code></pre></div><h3>
<a name="ready-to-get-started" href="#ready-to-get-started" class="anchor">
</a>
Ready to get started?
</h3>

<p>Create your account and start building in minutes</p>

<p><a href="https://app.portkey.ai/signup?ref=portkey.ai">Get Started</a><a href="https://portkey.ai/book-a-demo?ref=portkey.ai">Book a Demo</a></p>

How to choose the right AIOps platform

Drishti Shah — Fri, 17 Apr 2026 22:09:01 +0000

Your AI agent has been routing requests incorrectly, but your dashboards still show green because the infrastructure is healthy.

Traditional ops were built for infrastructure systems. They do not account for workloads where correctness depends on prompts, models, and multi-step reasoning. This guide outlines what enterprises should expect from an AIOps platform built for generative AI.

AI Agents and LLM apps are now production-critical systems

Enterprises are embedding LLMs into revenue-generating, customer-facing, and internal decision workflows. An agent that routes customer requests, summarizes contracts, or generates compliance documents is now a system of record.

However, only 43% of organizations say their data is ready for AI, and fewer than 30% of AI leaders think their CEOs are satisfied with GenAI results

Failures in LLM systems behave differently from infrastructure outages. A server is either up or down; it is binary. An LLM failure can be invisible. Wrong outputs that look correct, missed guardrails that go undetected, or runaway token costs that only appear on next month's bill.

The blast radius is qualitatively different. One hallucinated response from a customer support agent can trigger dozens of escalation tickets before anyone notices the pattern.

Teams are discovering that server-uptime thinking does not apply to LLM systems. Their correctness is probabilistic, and their behavior shifts with model and prompt changes. Without purpose-built operational tooling, enterprises are flying blind on their most strategically important systems.

Five capabilities to evaluate in an AIOps platform

When evaluating an AIOps platform for LLM and AI agent workloads, these five capabilities should be considered.

1. Unified LLM observability

Full request-level visibility across every model call, with distributed tracing for multi-step agent flows and analytics for latency, cost, and output behavior.

Portkey's observability layer is OTEL-compliant and captures the full request lifecycle: prompts, responses, token usage, latency, and metadata, all searchable and traceable across multi-step agent flows. This is backed by analysis of over 2 trillion production tokens processed across 3000+ models and 90+ regions.

2. Routing and resilience

Fallbacks to secondary models when a primary provider fails.
Load balancing across providers.
Conditional routing based on request type, cost target, or latency requirements.
Caching to reduce redundant model calls.

Portkey's AI Gateway helps you set up load balancing, fallbacks, conditional routing, to keep your apps reliable, even on production scale.

3. Guardrails and safety controls

Real-time enforcement of content and safety policies on both inputs and outputs.
Protection against prompt injection attacks
The ability to route or reject requests that fail checks before they reach users, not just log them after.

Portkey runs 50+ guardrail checks, including both rule-based and model-based checks, helping you stay under compliance.

4. Prompt management and versioning

Collaborative prompt template creation with version control.
Deploy prompt changes without code releases.
Experimentation across models using the same prompt template.

Portkey's prompt studio lets teams create, version, and deploy prompt templates without code changes, and run experiments across models using the same template. Prompt changes can be promoted across environments independently of application deployments.

5. Governance and cost controls

RBAC to control which teams and applications can access which models.
Per-workspace and per-user budget limits.
Audit logs that tie every LLM output to the team, prompt, and model that produced it.

Portkey manages over $180 million in annualized AI spend across its customer base, with per-team cost attribution, virtual key management, role-based access control, and full audit trails. It is ISO 27001 and SOC 2 certified, GDPR and HIPAA compliant, with SSO support for enterprise identity integration.

How enterprises actually adopt an AIOps platform

Most teams already have LLMs running in production before they formalize how to operate them. The starting point should be getting clear on what you cannot currently see.

Start by identifying one workflow and routing it through Portkey to get immediate visibility into requests, costs, and failures.
Add routing and fallback rules next. Understanding your traffic patterns first makes it easier to write policies that reflect how your system actually behaves.
Introduce guardrails before prompt versioning. Safety controls are harder to retrofit once workflows are in production.
Roll out governance, access control, and cost attribution last. These require enough operational history to set limits that are meaningful rather than arbitrary.

What changes when LLMs become operational systems

As agentic AI systems become more autonomous and multi-step, the need for centralized control only increases.

If you are already running LLMs in production, the next step is centralizing how those systems are routed, monitored, and controlled. Explore Portkey’s documentation to see how this works in practice, or book a demo with us.

FAQs

Q: What makes LLM cost management harder than traditional systems?

Costs scale with tokens and multi-step workflows, making them unpredictable without request-level tracking and budget enforcement.

Q: Do you need to change your application code to adopt AIOps?

No. Most platforms start by routing existing traffic through a gateway to capture logs, apply controls, and add observability.

Q: How do teams debug multi-step agent failures?

By tracing the full execution path across model calls, tools, and prompts to identify where the output diverged.

Q: When should an enterprise introduce governance for LLM usage?

As soon as multiple teams or workflows use LLMs, before cost, access, and compliance issues scale.

What is AIOps?

Drishti Shah — Thu, 16 Apr 2026 21:49:05 +0000

LLM systems can fail in ways that never show up on a dashboard. Latency is fine, error rates are steady, infrastructure is green — and yet outputs are drifting, costs are climbing, and nobody can point to why.

This is the gap most teams hit as they move AI from prototype to production. Traditional monitoring tells you whether your systems are running, not whether they're behaving correctly. Requests succeed, but you can't see which model decisions drove a cost spike, why output quality shifted after a config change, or whether a given response even met the intent behind it.

AIOps shifts the focus from system health to how each request is executed. To make that work, you need a control layer that sits between your application and the models; one place where routing, policies, and visibility come together.

Why traditional MLOps falls short

Most platform teams start with the monitoring they already have: infrastructure metrics, API logs, latency dashboards. These are necessary but not sufficient. Here's where they break down.

Visibility gaps

Infrastructure metrics measure whether the system is running. They don't measure whether it's doing the right thing. A request that returns HTTP 200 with a hallucinated response looks identical to a correct one in your logs.

The signals that actually matter for LLM behavior — prompt quality, output relevance, model decision paths — are either absent from standard observability tooling or fragmented across provider dashboards with no shared context.

No request-level traceability

In a multi-step LLM workflow, a single user request can trigger multiple model calls, tool invocations, and validation steps across services. Traditional observability collects signals at each layer independently, but there's no unified trace connecting them.

When something goes wrong, you have isolated error events with no way to reconstruct the full execution path. Debugging becomes manual correlation across systems which becomes hard to reproduce.

LLM failures are different

Latency thresholds and error rate alerts are designed to catch infrastructure problems. LLM failures are different: incorrect outputs, degraded relevance, and cost drift don't move these metrics. Fallback chains can activate repeatedly due to poor response quality and the system will still appear healthy by every standard alert condition.

Costs are unpredictable

In most setups, token usage and cost are tracked retroactively through provider billing dashboards. There's no mechanism to enforce limits at the request or workflow level during execution. A single misconfigured workflow can silently consume a disproportionate share of your budget before anyone notices.

How AIOps solves these problems

Each of the failures above has the same root cause: there is no single operational layer that governs how requests are executed. AIOps introduces that layer. Here is what it enables.

End-to-end request traceability

With AIOps, every request carries a unified trace through its full execution path. You can see which prompt version ran, which model handled the request, how routing decisions were applied, what the output was, and where any failure occurred. Every step is connected under a single execution context.

This means debugging an LLM failure stops being manual log correlation. You follow the trace the same way you would in any other distributed system. OpenTelemetry provides the standardization layer that makes this possible, structuring logs, metrics, and traces consistently so signals across infrastructure and LLM-specific layers can be correlated rather than analyzed in isolation.

Routing and policy enforcement at runtime

AIOps gives you the ability to define and enforce routing behavior explicitly, rather than letting it happen implicitly across individual services. You set conditions: which model handles which request, when fallbacks activate, how retries behave, what content policies apply before responses reach users.

When something changes, such as a model degrading, a provider going down, or a cost threshold being hit, the system responds according to defined rules. Routing behavior becomes something you govern, not something you observe after the fact.

Usage control during execution, not after

AIOps shifts cost from a trailing metric to a governable variable. Instead of reviewing spend after the billing cycle closes, you can enforce constraints during execution:

Token limits per workflow, team, or API key
Rate limits on specific models or endpoints
Requests that exceed defined thresholds are throttled or blocked
Usage attributed to specific workflows or owners in real time

This means teams can enforce budgets, prevent runaway usage, and align model consumption with operational constraints without waiting for a billing surprise.

Governance and auditability

As LLM systems scale across teams, AIOps gives you the ability to define and enforce who can call which models, under what conditions, and with what constraints. Access control operates at the model, provider, and API key level.

Every request is logged with its associated prompt, model, routing decision, and output. This creates a traceable record that supports debugging, compliance validation, and root cause analysis at the request level. Governance stops being documentation and becomes something enforceable.

What this looks like in practice

The architecture that delivers AIOps for LLM systems is an operational layer that intercepts every request between your application and your model providers. At each step, AIOps evaluates routing policy, enforces access controls and usage limits, validates responses, and logs the full execution context.

Platforms like Portkey's AI Gateway implement this pattern, providing centralized routing, policy enforcement, and observability across all model interactions from a single interface.

Teams that adopt this approach typically see the same outcomes: faster debugging because traces are complete, more predictable costs because limits are enforced at runtime, and more consistent model behavior because routing is governed by policy rather than scattered across services.

FAQs

What problems does AIOps solve for LLM systems?

AIOps addresses the visibility and control gaps that standard infrastructure monitoring cannot cover: output drift, cost spikes without obvious triggers, failed requests with no traceable root cause, and policy enforcement across teams and workflows. It shifts the focus from whether requests are completing to how they are executing.

How is AIOps different from monitoring?

Monitoring tells you whether your system is running. AIOps tells you how it is behaving and gives you controls to change that behavior. The distinction matters most when a system appears healthy by every infrastructure metric but is producing incorrect, expensive, or inconsistent outputs.

Where should teams start?

Pick one workflow with limited visibility and focus on tracing requests end-to-end across prompts, model selection, token usage, and execution path. Once you have that baseline, introduce controls such as routing rules and usage limits before expanding to other workflows.

What skills are needed to implement AIOps for LLM systems?

Teams need familiarity with prompts and model behavior, basic observability concepts (logs, metrics, traces), and enough system design knowledge to introduce an operational layer between applications and models. Deep ML expertise is not required. This is fundamentally a platform engineering problem.

How do teams measure ROI?

The clearest signals are reduced time to root cause when something breaks, fewer unexpected cost increases, and fewer incidents caused by uncontrolled model behavior. Teams also report spending less time on manual log correlation and more time building, which is harder to quantify but consistently cited.

The Harness Tax: The Dead Weight Inside Your Coding Agent

Drishti Shah — Tue, 14 Apr 2026 06:53:26 +0000

Harnesses are not going away. Even the best models rely on them. Claude Code alone has ~512k lines of harness code. But nobody talks about what that harness actually costs you at inference time.

I wanted to know: when using coding agents, how much of the payload that hits the model is actually my message? And how much is the harness overhead added?

So I pointed three agents at Portkey's gateway and captured every request. Pi (the harness behind OpenClaw), OpenAI Codex, and Claude Code. Same request and complete token visibility. Then I gave each one the same two messages

Message 1: hey
Message 2: write a simple python script to check
fibonacci series and save on desktop as agent.py

Pi sent ~2,600 input tokens. Claude Code sent ~27,000. A 10x spread. Same task. Same model capability. The difference was pure harness overhead.

The Harness Tax

💡 The Harness Tax is every token your agent spends on itself before it spends a single token on your task.

You pay this tax before the model does a single unit of useful work.Every agent has one. You never see it unless you look at raw request logs. I routed all three agents through a gateway to get that visibility.

What Goes Into the Harness Tax?

Every request a coding agent makes to the model carries the full harness payload: tool definitions, system prompt, memory instructions, behavioral routing, and conversation history. All of it. On every turn.

Claude Code's harness costs roughly 27,000 input tokens per request. Codex costs about 15,000. Pi costs about 2,600.

And because the conversation history includes the model's previous responses, which were themselves inflated by verbose tool-call formatting, the payload grows faster than your actual conversation does.

A real coding session runs 30 to 50 turns. At Claude Code's rate, a 40-turn session burns through 1.12 million input tokens. Roughly half of those are harness overhead.

💡 You pay the harness tax whether you use the tools or not. The 24 extra tools in Claude Code were defined but never called. Their definitions shipped on every request anyway.

Context Rot

The harness tax isn’t just a cost problem. It’s an attention problem. Every extra token competes with your actual task: your code, your files, your intent.

As the context window fills, the model gets worse at reasoning over the tokens that matter. Every token the harness adds competes for attention against your code, your files, and your actual task. On a complex refactor where the model needs to hold three source files, a test suite, and twenty turns of conversation, 28,000 tokens of framework plumbing aren't sitting idle. They're noise.

💡 A 200k context window carrying 28k tokens of harness overhead isn't a 200k window. It's a 172k window with worse attention distribution.

The harness rots in a second way: staleness. Every component encodes an assumption about what the model can't do on its own. Those assumptions go stale fast. More on that below.

Thin Harness, Fat skills

Pi gives the model four capabilities: read. write, edit a file and run a shell command. That's the entire tool surface.

The bet is that a model trained on millions of shell sessions, the internet, and GitHub repos already knows how to compose those primitives into anything else. You don't need a dedicated list_directory tool when ls -la exists. You don't need search_files when the model can write grep -r on its own.

"All frontier models have been RL-trained up the wazoo. They inherently understand what a coding agent is."— Mario Zechner, Pi's creator

Anthropic's harness engineering team demonstrated this concretely over three model generations. Their coding agent harness for Sonnet 4.5 required context resets because the model would start wrapping up work prematurely as the window filled. Opus 4.5 shipped, resets became unnecessary. Opus 4.6 shipped; they stripped out sprint decomposition entirely, and it still worked better.

Three model generations. Three layers of harness removed. Load-bearing in January, dead weight by March.

Harnesses encode assumptions that go stale as models improve - Anthropic Blog

An agent has three layers. Complexity should push up into the model, which gets better at reasoning, planning, and self-correction with every release. It should push down into infrastructure, where routing, governance, observability, and cost controls don't ride along in the context window. The harness in the middle should carry as little as possible.

What This Means

This was a narrow benchmark. Two messages, one trivial task. Claude Code's deep tooling may earn back its overhead on complex work that genuinely exercises those 28 tools.

What this benchmark does show: the overhead exists, it's measurable, and almost nobody is looking at it. For most tasks, the model is carrying 15,000 tokens of framework plumbing it doesn't need. And that overhead is growing slower than models are improving, which means the tax gets harder to justify.

Route your agent through Portkey to measure your own harness tax.

Further Reading:

This article was first published on X.

Cursor best practices for enterprise teams

Drishti Shah — Thu, 09 Apr 2026 09:23:12 +0000

Cursor has become one of the most widely adopted AI coding tools in enterprise engineering organizations. The individual productivity gains are real.

But operational control across teams is a different problem entirely. Platform teams now face a new challenge: scattered API keys, invisible token spend, no budget controls, and compliance gaps.

This best practices playbook is for platform teams running Cursor across dozens or hundreds of engineers, and how they can implement them.

What Cursor is and why teams are adopting it at scale

Cursor is an AI-first code editor built on a VS Code fork, designed for AI-native development workflows. It integrates large language models directly into the development environment for autocomplete, multi-file edits, codebase-aware chat, and agent-driven tasks that can plan and implement changes across a repository.

Teams typically access Cursor through one of several plans, depending on how they want to manage usage and billing:

Plan

Best for

What it includes

|
|

Free

Exploration and evaluation

Limited premium requests and basic AI features

|
|

Pro

Individual developers

Higher request limits and access to more capable models

|
|

Teams

Engineering teams

Centralized billing, usage analytics, SSO, and admin controls

|
|

Enterprise

Large organizations

Custom pricing, expanded quotas, SCIM provisioning, and enhanced compliance controls

Many teams also use Cursor in Bring Your Own Key (BYOK) mode, connecting directly to providers like OpenAI, Anthropic, Bedrock, or Vertex AI and paying per API usage.

Where Cursor's native tooling falls short at scale

At enterprise scale, five core operational gaps emerge. These are infrastructure and governance gaps, not issues with how developers use Cursor.

Vendor lock-in

Cursor connects to whatever provider credentials a developer supplies. Switching providers, models, or accounts requires updating configurations on individual machines. There is no centralized way to manage provider selection, failover, or traffic distribution at the team level.

Invisible spends due to silos

With subscription plans, usage is opaque at the org level. With BYOK, every developer's spend is siloed to their own key. There is no centralized, cross-provider view of usage by team, project, or environment , and no real-time visibility into which workflows or teams are driving cost.

Teams also run into unused usage problems, where prepaid credits, committed spend, or reserved capacity with one provider go underutilized because usage is spread across individual developer keys instead of being centrally routed and optimized.

Credential sprawl and data exposure risks

When developers use their own API keys, those keys end up stored in environment files, shared in Slack, or committed to repositories. As adoption grows, this turns into a credential management problem rather than a developer workflow problem. There is no built-in credential hierarchy for provider API keys, no scoped keys per team or project, and no audit trail showing which key made which request.

Additionally,teams often have no visibility into what data is being sent to which model provider. Source code, internal documentation, or sensitive data can be included in prompts without any centralized filtering, logging, or policy enforcement, which creates data security and compliance risks.

Runaway costs

There is no native mechanism to set budget caps per team or project before a session begins. A single long-running agent session on a frontier model can consume a significant amount of tokens without any warning or enforcement.

No restriction over model usage

Role-based access exists at the IDE and workspace level, but not at the model and provider access layer. Teams cannot easily restrict access to specific models, route routine tasks to cheaper models, or centrally enforce model usage policies.

All of these issues appear for the same reason: Cursor is being used as a shared system, but there is no control layer between Cursor and the model providers.

Cursor best practices: Add a gateway between Cursor and your LLM providers

The architectural solution to the problems above is to introduce an AI gateway between Cursor and your LLM providers. The gateway becomes the control plane that applies policy, logs usage, routes requests, and enforces limits before any request reaches a model.

Here are six implementation-ready cursor best practices:

Practice 1: Provider flexibility and failback

Instead of sending requests directly from Cursor to a single provider, requests go through the gateway. This gateway can route traffic to multiple providers such as Anthropic, Google Gemini, Google Vertex AI, Azure OpenAI, AWS Bedrock, and others. Teams can configure routing logic, load balancing, and automatic fallbacks so that if a primary provider is unavailable or rate-limited, requests are routed to a backup provider automatically.

Practice 2: Centralized credential management

Provider API keys are stored centrally in the gateway instead of being distributed to developers. The developers use scoped API keys per users, team or project that inherit provider access. Access can be revoked or rotated from a single place without touching developer machines.

Practice 3: Budget limits and rate limits before coding starts

Budget limits can be set per API key, user, team or project before access is distributed. Rate limits can prevent any single developer or team from overwhelming shared provider capacity. This is especially important when multiple teams share the same Bedrock or Vertex AI quotas.

Practice 4: Full request logging with cost attribution

Every request can be logged with metadata such as team, project, developer, model, token usage, cost, and latency. This allows teams to attribute AI spend accurately and investigate usage spikes or anomalies.

Practice 5: Guardrails on inputs and outputs

With the LLM gateway, you can apply PII guardrails, content filtering, and custom security rules to prompts and responses before they reach the provider or return to the developer. This reduces the risk of sensitive data exposure and helps enforce organizational policies.

Practice 6: Model access governance

Teams can define which models specific teams or developers are allowed to use, and routing logic can be updated centrally without requiring changes to individual developer setups.

The key idea is that developers continue using Cursor exactly as they do today. However, model access, routing, budgets, and guardrails are enforced at the infrastructure layer instead of relying on individual developer configuration.

How Portkey makes this operational model possible

The best practices described above require an operational layer between Cursor and your LLM providers. This is where Portkey acts as an AI gateway that sits between Cursor and model providers. The platform helps turn model access into a controlled, observable infrastructure layer instead of a direct connection from each developer’s machine.

Once Cursor is connected to Portkey, these policies, like routing, budgets, guardrails, logging, and access control, can be enforced centrally.

From the developer’s perspective, nothing changes. From the platform team’s perspective, model access becomes centralized, observable, and governed.

Connecting Cursor to Portkey

The Cursor integration follows a simple pattern:

Add your model providers in Portkey and configure routing, budgets, and guardrails
Generate a scoped Portkey API
Define routing logic, fallback providers, and limits
In Cursor Settings, enable the OpenAI API Key option
Enter the Portkey API key and override the base URL to https://api.portkey.ai/v1
Verify the connection and start using Cursor normally

All Cursor traffic now routes through Portkey, where policies, logging, routing, and guardrails are applied automatically.

Scaling Cursor Across Engineering Teams: Governance Patterns That Work

Once Cursor traffic flows through a gateway, platform teams can implement governance patterns that make usage predictable and manageable across the organization.

Capability area

Cursor Teams/Enterprise

AI gateway (Portkey)

|
|

Identity and access

SSO, admin controls, team management

Scoped API keys per team/project

|
|

IDE governance

Privacy mode, Team Rules, IDE policies

Not applicable

|
|

LLM routing

Not available

Fallbacks, load balancing, conditional routing

|
|

Budget controls

Per-user request accounting (Teams plan)

Per-team and per-project budget limits, rate limits

|
|

Observability

Basic usage metrics

Detailed logs with prompts, tokens, cost, latency, metadata tags

|
|

Guardrails

Not available

PII detection, content filtering, prompt injection blocking

|
|

Provider management

BYOK or Cursor credits

Centralized provider keys, multi-provider access

|
|

Audit trails

Limited

Complete request/response audit logs

Building AI coding infrastructure that holds up in production

If you are running Cursor across multiple teams, start by routing one team through Portkey, setting budget limits, and enabling request logging and fallback routing.

Once the control layer is in place, you can expand usage across the organization without losing visibility, control, or cost predictability.

Explore the Cursor integration docs to see the setup process or book a personalized demo for an enterprise deployment walkthrough.

FAQs

Q: Can I use Portkey with Cursor without switching away from my existing LLM provider?

Yes. Portkey works alongside your existing providers.

Q: What happens when a team hits their budget limit set in Portkey?

Alerts are sent as teams approach their budget thresholds so action can be taken before limits are reached. Budgets can be adjusted or reset at any time without changing developer configurations.

Q: How is this different from Cursor's built-in Teams or Enterprise plan?

Cursor handles IDE governance such as SSO and workspace policies. Portkey handles the LLM infrastructure layer, including routing, budgets, observability, and guardrails.

Q: Do developers need to change anything about how they use Cursor after the gateway is set up?

No. Only the API key and base URL change in Cursor settings. The developer workflow remains the same.

What is AI lifecycle management?

Drishti Shah — Tue, 24 Mar 2026 08:23:14 +0000

AI lifecycle management begins before any model is called. The first step is defining where AI should be used, what success looks like, and how risk is managed from the start.

For most teams, this means moving beyond isolated experiments to identifying repeatable, high-impact use cases, whether internal copilots, workflow automation, or customer-facing AI features. AI lifecycle management begins with defining the right use case and each use case should have clearly defined success metrics across quality, latency, cost, and adoption , so performance can be measured as systems scale.

Equally important is risk classification. Not all AI applications carry the same level of sensitivity. Internal productivity tools, for example, require different controls compared to systems handling user data or making critical decisions. Establishing this early helps determine the right level of governance, guardrails, and monitoring required downstream.

AI lifecycle management is the process of building, deploying, and continuously improving AI systems in production.

Data governance

AI systems are only as reliable as the data they operate on. Managing the data lifecycle is critical to ensuring outputs remain accurate, safe, and compliant over time.

This starts with controlled data access, defining what data can be used in prompts, which systems can access it, and how sensitive information (like PII) is handled. As AI usage scales across teams, enforcing consistent policies becomes essential to avoid leakage or misuse.

Equally important is data quality and consistency. Inputs to AI systems should be structured, validated, and, where necessary, filtered before reaching the model. This includes applying guardrails such as content filters, redaction, or transformation layers to standardize requests and responses.

Model selection and evaluation

Choosing the right model is not a one-time decision. It is an ongoing process of balancing quality, latency, and cost across different use cases.

Most teams start by testing a few models, but as AI usage grows, this quickly becomes harder to manage. Different tasks may require different models, some optimized for reasoning, others for speed or cost efficiency.

Evaluation plays a critical role here. This includes offline testing on curated datasets to benchmark performance, as well as online evaluation on live traffic to understand real-world behavior. Tracking outputs, comparing responses, and identifying regressions helps teams make informed decisions when switching or upgrading models.

Over time, models evolve, new providers emerge, and pricing changes. Treating model selection as part of the lifecycle allows teams to continuously optimize performance without disrupting production systems.

Prompt and agent management

Prompts and agents define how AI systems behave in production. Managing them as versioned, testable components is essential to maintaining consistency and control.

Prompts should not live inside application code. Instead, they need to be versioned, reusable, and independently updatable , so teams can iterate without redeploying entire systems. Even small changes in prompts can significantly impact output quality.

As systems evolve, many teams move from simple prompts to agents with tool-calling and multi-step reasoning. This introduces additional complexity, agents can behave non-deterministically, depend on external tools, and vary based on context. Managing this requires clear definitions of agent logic, tool access, and fallback behavior.

Guardrails also play an important role at this layer. Applying constraints on inputs and outputs helps ensure that prompts and agents operate within defined boundaries, especially in sensitive or user-facing scenarios.

Security and governance

As AI systems move into production, they need the same level of control and accountability as any other critical infrastructure.

This begins with access control i.e. defining who can use which models, tools, and data. Role-based access (RBAC), workspace isolation, and scoped permissions ensure that usage is intentional and contained, especially in larger organizations.

Equally important is cost and usage governance. AI workloads can scale quickly, making it essential to enforce budgets, rate limits, and usage policies at both the user and application level. This prevents unexpected spend while maintaining system stability.

Security also extends to policy enforcement and compliance. Guardrails help filter unsafe or non-compliant content, while audit logs provide a clear record of how AI systems are being used. For enterprise environments, alignment with standards like GDPR, SOC 2, and ISO requirements is often mandatory.

Deployment and routing

A critical part of AI lifecycle management is ensuring production reliability, scale, and constant change, without adding complexity to the application itself.

A unified gateway layer sits between applications and model providers, abstracting away provider-specific logic and enabling teams to manage AI traffic centrally. This allows applications to remain stable even as models, providers, or configurations evolve.

At the core of this is intelligent routing. Requests can be dynamically routed across models and providers based on latency, cost, quality, or availability. If a provider fails or degrades, automatic failover ensures continuity without requiring changes in application code.

Production reliability also depends on built-in controls like retries, timeouts, rate limits, caching, and load balancing. These mechanisms help reduce failures, smooth traffic spikes, and maintain consistent performance under scale.

For enterprise deployments, this layer also supports regional routing, private infrastructure (VPC), and secure production environments, ensuring compliance and data control alongside performance.

Observability and monitoring

Effective AI lifecycle management requires deep visibility into system behavior. Without it, teams have no reliable way to understand performance, debug issues, or improve outcomes.

AI observability goes beyond traditional metrics. In addition to latency, error rates, and throughput , teams need visibility into model outputs, prompt behavior, token usage, and cost. This helps answer questions like: Why did a response fail? Which model is underperforming? Where is spend increasing?

Centralized logging and tracing make this possible. Every request should carry context, model used, prompt version, routing decision, and response, so teams can trace issues end-to-end. For more complex systems, especially agents, execution traces help visualize multi-step workflows and identify where breakdowns occur.

Monitoring also enables proactive control. Tracking trends in performance and usage allows teams to detect anomalies, enforce budgets, and maintain consistent user experience as traffic scales.

With a unified observability layer, AI systems become measurable and debuggable, turning what would otherwise be a black box into a system teams can actively manage and improve.

Continuous improvement

AI lifecycle management doesn’t end at deployment. It requires continuous optimization across every layer of the system.

With Portkey's AI Gateway, teams can iterate across the full AI lifecycle, from model selection and prompt updates to routing changes and cost optimization, without disrupting production. Observability, logging, and tracing provide the foundation to evaluate performance, detect regressions, and understand real-world behavior.

Because all requests flow through a unified gateway, improvements can be applied centrally. Teams can update prompts, switch models, refine routing strategies, and enforce policies in one place, instead of making changes across multiple applications.

Full Width CTA<br>
* {<br>
margin: 0;<br>
padding: 0;<br>
box-sizing: border-box;<br>
}</p>
<div class="highlight"><pre class="highlight plaintext"><code> body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
line-height: 1.6;
color: #333;
background: #f5f5f5;
padding: 40px 20px;
display: flex;
align-items: center;
justify-content: center;
min-height: 100vh;
}

.container {
    max-width: 1200px;
    width: 100%;
}

/* Full Width CTA */
.cta-full {
    text-align: center;
    padding: 60px 40px;
    background: #000;
    color: white;
    border-radius: 12px;
}

.cta-full h2 {
    font-size: 36px;
    margin-bottom: 16px;
 Color: white;
}

.cta-full p {
    font-size: 18px;
    margin-bottom: 32px;
    opacity: 0.9;
}

/* Button Styles */
.btn-primary {
    display: inline-block;
    background: white;
    color: #000;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: none;
    cursor: pointer;
    font-size: 16px;
}

.btn-primary:hover {
    background: #f0f0f0;
    transform: translateY(-1px);
    box-shadow: 0 4px 12px rgba(255,255,255,0.15);
}

.btn-secondary {
    display: inline-block;
    background: transparent;
    color: white;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: 2px solid white;
    cursor: pointer;
    font-size: 16px;
}

.btn-secondary:hover {
    background: white;
    color: #000;
}

.btn-group {
    display: flex;
    gap: 16px;
    flex-wrap: wrap;
    justify-content: center;
}

@media (max-width: 768px) {
    .cta-full h2 {
        font-size: 28px;
    }

    .cta-full p {
        font-size: 16px;
    }

    .btn-group {
        flex-direction: column;
        width: 100%;
        max-width: 300px;
        margin: 0 auto;
    }

    .btn-primary,
    .btn-secondary {
        width: 100%;
        text-align: center;
    }
}

</style>
</code></pre></div><h2>
<a name="ship-ai-agents-faster-with-portkey" href="#ship-ai-agents-faster-with-portkey" class="anchor">
</a>
Ship AI Agents faster with Portkey
</h2>

<p>Everything you need to build, deploy, and scale AI agents</p>

<p><a href="https://portkey.sh/blogs?ref=portkey.ai">Get Started</a><a href="https://portkey.sh/blog?ref=portkey.ai">Book a Demo</a></p>

Claude Code agents: what they are, how they work, and how to scale them

Drishti Shah — Mon, 09 Mar 2026 14:57:31 +0000

Claude Code is now the most widely used AI coding agent. We all know what it does. The harder question is what happens when you roll it out across a team of 20, 50, or 200 engineers, each running agentic loops that spawn subagents, call MCP tools, and burn through tokens autonomously.

This post covers the agent architecture inside Claude Code, what breaks when you take it to production, and how to add the governance and observability layer that enterprises need.

How Claude Code agents work under the hood

Claude Code runs a straightforward agent loop: the model produces a message, and if it includes a tool call, the tool executes and results feed back into the model. No tool call means the loop stops and the agent waits for input. It ships with around 14 tools spanning file operations, shell commands, web access, and control flow.

What makes the system powerful is the layered agent architecture sitting on top of that loop.

Subagents

Claude Code delegates to specialized subagents that each run in their own context window. The built-in ones include Explore (read-only codebase search), Plan (research for planning mode), and a general-purpose agent for complex multi-step tasks. Each subagent has restricted tool access and independent permissions.

The real value here is context management. Agentic tasks fill up the context window fast. Subagents keep exploration and implementation out of the main conversation, preserving the primary context for decision-making.

You can also define custom subagents as markdown files with YAML frontmatter, scoped to specific tools, models, and system prompts. A read-only code reviewer, a documentation generator, a security auditor, each with exactly the permissions it needs.

Agent teams

Subagents work within a single session. Agent teams coordinate across separate sessions, enabling parallel workflows. Think: one agent building a backend API while another builds the frontend, each in an isolated Git worktree.

Anthropic's own 2026 trends report names multi-agent coordination as a top priority for engineering teams this year.

Skills, hooks, and the Agent SDK

Skills are reusable instruction packages that Claude loads automatically when it encounters a matching task. Hooks trigger actions at specific workflow points (run tests after changes, lint before commits). The Claude Agent SDK gives teams the same underlying tools and permissions framework to build custom agent experiences outside the terminal entirely.

What breaks at scale

None of these capabilities ship with the operational scaffolding that production environments demand. Here is where teams consistently run into trouble.

Runaway costs with no attribution

A single complex task can spawn multiple subagents, each making dozens of tool calls across extended agentic loops. Multiply that by a team of engineers, and token spend becomes unpredictable. Claude Code has no native cost tracking by user, team, or project. The first monthly bill is usually the wake-up call.

Zero observability

There is no built-in logging, tracing, or audit trail. You cannot see what prompts were run, which subagents were spawned, how long requests took, or where errors occurred. For debugging, optimization, and compliance, this is a non-starter.

Single-provider fragility

Claude Code ties you to one provider. If that provider hits rate limits, has an outage, or is unavailable in a region you need, your workflows stop. Switching providers requires manual reconfiguration.

No access control

There is no centralized way to manage API keys, restrict access by role or team, enforce rate limits, or set budget caps. Every developer's setup is independent, which makes governance at scale nearly impossible.

Ungoverned MCP tool access

Claude Code supports MCP for connecting to external tools like GitHub, Slack, databases, and internal APIs. When agents can take real actions through these tools, the question is not whether MCP works but whether it should be allowed, for which agent, with what permissions, and who is watching. Without a governance layer, MCP adoption stalls at experimentation.

Adding governance with an AI gateway

An AI gateway sits between Claude Code and the model providers, adding observability, access control, and reliability without changing how developers use the tool. Portkey is built for exactly this.

Team-level governance

Isolate teams with separate workspaces, each with its own budget, rate limits, and access controls. Manage API keys centrally instead of distributing raw keys to individual developers. Enforce org-wide policies without building custom wrappers.

Multi-provider routing

Route Claude Code requests across Anthropic, Bedrock, and Vertex AI through a single endpoint. Define fallback logic, add load balancing or define conditions using meta data for smarter routing. Developers change nothing in their workflow. The gateway handles it.

Just add this to settings.json

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://api.portkey.ai",
    "ANTHROPIC_AUTH_TOKEN": "YOUR_PORTKEY_API_KEY",
    "ANTHROPIC_CUSTOM_HEADERS": "x-portkey-api-key: YOUR_PORTKEY_API_KEY\nx-portkey-provider: @anthropic-prod"
  }
}

Cost tracking and budget controls

Every request is logged with token usage and cost, broken down by provider, team, user, and project. Set hard budget limits per team or developer. Get alerts when spend crosses thresholds. Route to cheaper providers when cost matters more than latency.

For teams running parallel agent fleets, this is the difference between controlled scaling and a surprise invoice.

Full observability

Portkey's OTEL-compliant observability captures every request with metadata: latency, tokens, error rates, provider, route, and custom tags. Filter and search across all Claude Code usage in one dashboard. Debug failed agent loops, monitor MCP tool calls, and track performance trends over time.

Guardrails

Enforce PII detection, content filtering, prompt injection protection, and token limits before requests reach the model. This matters especially for agentic workflows processing sensitive codebases or interacting with production systems through MCP.

MCP Gateway

Portkey's MCP Gateway provides centralized authentication, role-based access control, and full observability for MCP tool connections. Agents authenticate once through Portkey. The gateway handles credential injection, permission checks, and request logging for every configured MCP server.

Platform teams control exactly which teams can access which MCP servers and tools. Every tool call is logged with full context: who accessed what, when, and with what parameters. This is the layer that makes MCP safe for production use.

Putting it together

The shift happening right now is clear. Claude Code agents are moving from individual developer tools to team-level infrastructure. Subagents, agent teams, skills, MCP integrations, and orchestration layers are making agents dramatically more capable. But capability without governance is a liability.

The teams scaling Claude Code successfully are treating it as an infrastructure problem: routing through a gateway, tracking every request, setting budget boundaries, and governing tool access centrally. The developers keep typing claude in their terminal. The platform team gets the visibility and control they need.

To run Claude Code agents with governance at scale, get started with Portkey or book a demo for a walkthrough.

LLM Deployment Pipeline Explained Step by Step

Drishti Shah — Fri, 27 Feb 2026 11:41:00 +0000

LLM deployment is the process of taking a trained language model and converting it into a production service that can handle live user requests reliably and at scale.

In practice, a truly production-ready LLM system is shaped by five interconnected layers: containerization, infrastructure and GPU allocation, the API and serving layer, autoscaling, and monitoring. These layers keep performance stable, costs predictable, and outputs trustworthy as real traffic flows in.

💡 Gartner estimates that more than 80% of enterprises will have generative AI applications in production by 2026, yet most online tutorials still stop at “get it running” and ignore what happens after.

Don’t worry, though, because this article covers the full lifecycle teams struggle with post-launch, including scaling predictably under real demand, monitoring probabilistic outputs that can fail silently, and controlling costs that compound with every request.

Cloud APIs, self-hosted, or on-premises

Every LLM deployment begins with a foundational architectural decision: consume models via cloud APIs, run them yourself on cloud GPUs, or operate fully on-premises. The right choice depends on how you balance speed, control, cost, and compliance.

Architecture path

What it offers

Tradeoffs

Typical use case

|
|

Cloud APIs (OpenAI, Anthropic, Microsoft Azure)

Fastest route to production, no infrastructure management, elastic scaling

Pay-per-token costs compound, vendor lock-in risk, data residency constraints

Prototyping and early production

|
|

Self-hosted on cloud GPUs

Full control over open models (Llama, Mistral), performance tuning

You own reliability, scaling, and ops complexity

Teams needing flexibility and customization

|
|

On-premises

Strong compliance posture (HIPAA, GDPR), predictable cost at high utilization

Hardware expense, deep infrastructure expertise required

Regulated or high-volume workloads

For self-hosting, infrastructure costs typically range from about $0.75/hour for an L4 GPU up to $3.25/hour for an H100, before orchestration and redundancy. On-premises shifts that spend into upfront hardware investment that only pays off with sustained utilization.

Across all three paths, data residency and security requirements often become the deciding factor.

Also, keep in mind that according to Portkey’s LLMs in Prod 2025, 40% of teams now use multiple LLM providers, up from 23% ten months earlier. As a result, you have to carefully consider how your architectural choice will affect portability and long-term leverage to avoid the risk of vendor lock-in. You can also use Portkey’s AI Gateway to mitigate that risk by routing across 1,600+ models through a single API, so switching providers becomes a configuration change rather than a rewrite.

GPU selection and infrastructure sizing

LLMs only run fast when everything fits inside GPU memory (VRAM). That includes not just the model itself, but also the temporary memory it uses to hold conversation context while generating responses, called the KV cache. If either spills into normal system memory, performance drops sharply.

This is why sizing GPUs by model size alone often fails in production. For example:

A 70B model (FP16) needs about 140GB for weights. If you add a standard 8k context window and batch headroom, total VRAM reaches ~161GB, requiring two 80GB GPUs (like the H100) or a single H200.
An 8B model (FP16) uses roughly 16GB for weights and ~18.4GB with context overhead – an ideal fit for one 24GB L4 GPU.

So, always budget VRAM for both the model and its live working memory.

Inference frameworks and API endpoint design

The efficiency of an LLM deployment is defined by the request lifecycle:

During prefill, the full prompt is processed in parallel and is primarily compute-bound.
During decode, tokens are generated one by one and become memory-bandwidth bound.

For this, you’ll need an inference framework to manage the KV cache so the GPU never sits idle between these phases.

Framework

Key innovation

2026 performance edge

Best for

|
|

vLLM

PagedAttention

Industry standard; highest stability for general APIs.

High-concurrency production APIs.

|
|

SGLang

RadixAttention

29% higher throughput than vLLM; instant prefix caching.

Agentic and RAG workloads.

|
|

TensorRT-LLM

Compiled kernels

Lowest latency (up to 2x faster TTFT on NVIDIA).

Latency-critical systems.

|
|

LMDeploy

TurboMind C++ engine

Rivals SGLang; excellent 4-bit/8-bit throughput

High-volume batch inference.

The hierarchy of metrics

First is TTFT (Time to First Token), which is how quickly users see the first word appear – the most critical UX signal. With superior cache reuse, SGLang achieves roughly 3.7× faster TTFT at low concurrency, making it ideal for highly interactive agents.

Next is TPS (Tokens Per Second), which defines total system capacity. While vLLM is commonly the default choice, SGLang can deliver about 33% higher TPS in multi-turn conversations by avoiding repeated context processing.

Then comes KV cache utilization , which determines stability. Modern inference engines target around 80% GPU memory usage to balance throughput and safety, while pushing toward 95% can increase the risk of instability and trigger crashes during graph capture and runtime spikes.

API design for production UX

Streaming is mandatory. Total generation time matters less than TTFT – users tolerate a 10-second response if the first token appears in ~200ms.

Also, continuous batching prevents queue bottlenecks. Your framework must allow new requests to join an active decode loop, rather than waiting in a strict first-in-first-out pipeline.

For RAG systems, prefix caching is critical. By caching the system prompt and retrieved documents, you can reduce prefill latency and cost by up to 90% for returning users, dramatically improving both responsiveness and efficiency.

Scaling strategies that actually work for LLMs

Traditional autoscaling based on CPU or memory utilization fails for LLM workloads. Inference is primarily memory-bandwidth and I/O bound, so scaling decisions must reflect user-facing latency and GPU saturation, not generic resource metrics.

For Kubernetes deployments, configure the Horizontal Pod Autoscaler (HPA) around three signals:

Queue depth (num_requests_waiting) – the most responsive signal: Best practice today is to trigger scale-out when just 3–5 requests begin to queue, preventing latency spikes before users feel them.
P90 TTFT (Time to First Token) – the SLA-level experience metric: If the 90th percentile drifts beyond roughly 200–500ms, new replicas should spin up to preserve a fast, conversational feel.
GPU KV cache utilization (gpu_cache_usage_perc) – shows physical saturation: High cache usage combined with a growing queue means existing pods can’t accept more tokens without instability or crashes.

Additionally, cold starts were once the biggest constraint, with model loads taking 30–120 seconds. However, modern streaming loaders have reset that baseline. Using NVIDIA’s Run:ai Model Streamer, a 7B model can stream from object storage into GPU memory in 5–15 seconds. When paired with frameworks like vLLM, total Time to Ready (container spin-up plus engine initialization) is now under 25 seconds, reducing the need for costly warm pools.

In practice, Kubernetes users should avoid default CPU-based scaling rules. Instead, configure the HPA to scale using queue depth and TTFT, since those directly reflect user experience and system saturation.

Scaling at the infrastructure layer is only part of the solution. Above it, the AI Gateway adds production safeguards such as automatic retries, exponential backoff, circuit breakers, and provider failover. These mechanisms prevent temporary overloads or upstream failures from cascading into user-visible downtime.

With the Gateway, a leading online food delivery platform successfully handled a 3,100× traffic increase, peaking at 1,800 requests per second, while maintaining 99.99% uptime by combining HPA-based scaling with gateway-level resilience controls.

Monitoring LLMs in production

LLMs often fail silently, returning a successful status code (200 OK) while producing hallucinated or misleading outputs. Monitoring and observability are essential here: monitoring confirms the system is running, while observability confirms the response is actually correct. For this, production teams need a dual-layer telemetry approach.

Layer 1: Operational metrics

Operational metrics (the heartbeat of your inference stack) ensure the fleet is responsive and stable under load. The most important are:

P95 TTFT (Time to First Token) – the gold standard for user experience: In 2026, enterprise targets have tightened to 200–500ms to preserve a natural, conversational feel.
TPOT (Time Per Output Token): This measures streaming smoothness, with a goal of under 50ms so responses outpace human reading speed.
KV cache saturation – an early warning system: Rising cache usage paired with growing queues signals an impending generation stall or crash long before errors appear.

Layer 2: Semantic observability

Semantic observability evaluates whether answers are actually correct and safe. Using LLM-as-a-Judge techniques, teams can detect quality drift that traditional APM tools miss:

Faithfulness and grounding verify that responses truly exist in retrieved documents, forming the core guardrail for RAG systems.
Cost harvesting attacks track abusive prompts designed to explode token usage and drain budgets – a growing 2026 threat.
Tone and bias drift monitor whether the model’s personality or alignment shifts subtly across millions of requests.

Modern platforms now unify these layers. For instance, Portkey provides observability directly within its AI Gateway, capturing over 40 data points per request across cost, performance, and accuracy – earning recognition as a Gartner Cool Vendor in LLM Observability (2025).

👀 Typical production SLOs in 2026 include:

99.99% availability.
P95 TTFT of 200–500ms.
P50 end-to-end RAG latency under 1.5 seconds.
Hallucination rates below 1%, depending on domain risk tolerance.

Controlling costs as you scale

LLM costs grow in a fundamentally different way than traditional infrastructure. Instead of paying mainly for reserved compute, spending increases with usage volume, since every request consumes tokens. What looks like a few cents per call can quickly become a major expense at millions of requests per month.

To manage this, teams must start with cost attribution – tracking spend by feature, team, and user. Without this visibility, optimization happens too late, after budgets have already been exceeded.

Once costs are visible, three levers drive meaningful savings:

Semantic caching avoids recomputing responses for repeated or similar queries. For instance, semantic cache delivers around 20% average cache hit rates in early production and up to 60% in focused RAG systems, with roughly 20× faster responses at near-zero cost on cached requests.
Intelligent routing sends simpler requests to cheaper, faster models while reserving premium models for complex reasoning.
Batch processing , such as OpenAI’s Batch API, reduces costs for large, non-urgent workloads by grouping requests into lower-priced bulk jobs with flexible completion windows.

If you’re looking for a way to maintain control over costs, opt for Portkey’s AI Gateway. One delivery platform saved $500,000 through optimized routing and caching via Portkey while maintaining 99.99% uptime across billions of requests.

Building your deployment pipeline

As you’ve seen, LLM deployment is an operational pipeline made up of infrastructure, serving, scaling, monitoring, and cost control, with each layer compounding the reliability and efficiency of the next.

If your models are already running but the operational layer is holding you back, this is where an AI gateway like Portkey fits.

Portkey sits above your inference stack, providing intelligent routing across providers, built-in observability, semantic caching, cost controls, and automatic failover, so you can scale reliably without building custom infrastructure.

Ready to move from “model deployed” to truly production-ready? Try Portkey’s AI gateway now and turn your LLMs into reliable, cost-efficient production systems!

.container {
    max-width: 1200px;
    width: 100%;
}

/* Full Width CTA */
.cta-full {
    text-align: center;
    padding: 60px 40px;
    background: #000;
    color: white;
    border-radius: 12px;
}

.cta-full h2 {
    font-size: 36px;
    margin-bottom: 16px;
}

.cta-full p {
    font-size: 18px;
    margin-bottom: 32px;
    opacity: 0.9;
}

/* Button Styles */
.btn-primary {
    display: inline-block;
    background: white;
    color: #000;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: none;
    cursor: pointer;
    font-size: 16px;
}

.btn-primary:hover {
    background: #f0f0f0;
    transform: translateY(-1px);
    box-shadow: 0 4px 12px rgba(255,255,255,0.15);
}

.btn-secondary {
    display: inline-block;
    background: transparent;
    color: white;
    padding: 14px 32px;
    border-radius: 6px;
    text-decoration: none;
    font-weight: 500;
    transition: all 0.2s ease;
    border: 2px solid white;
    cursor: pointer;
    font-size: 16px;
}

.btn-secondary:hover {
    background: white;
    color: #000;
}

.btn-group {
    display: flex;
    gap: 16px;
    flex-wrap: wrap;
    justify-content: center;
}

@media (max-width: 768px) {
    .cta-full h2 {
        font-size: 28px;
    }

    .cta-full p {
        font-size: 16px;
    }

    .btn-group {
        flex-direction: column;
        width: 100%;
        max-width: 300px;
        margin: 0 auto;
    }

    .btn-primary,
    .btn-secondary {
        width: 100%;
        text-align: center;
    }
}

</style>
</code></pre></div><h2>
<a name="ship-faster-with-ai" href="#ship-faster-with-ai" class="anchor">
</a>
Ship Faster with AI
</h2>

<p>Everything you need to build, deploy, and scale AI applications</p>

<p><a href="#">Get Started</a><a href="#">Book a Demo</a></p>

The best approach to compare LLM outputs

Drishti Shah — Tue, 24 Feb 2026 13:45:39 +0000

Once LLMs are in production, output quality stops being a subjective question and becomes an operational one. Teams are no longer asking whether they need to evaluate outputs, but how to do it reliably as systems evolve.

Production systems can change frequently. Prompts are iterated on, models are swapped, routing logic is adjusted, and traffic patterns shift. In this environment, point-in-time judgments of quality are not useful. What matters is whether output quality is stable, improving, or degrading across these changes.

A repeatable approach to measurable output quality gives teams a way to:

Establish baselines that survive prompt and model updates
Compare outputs across versions, providers, and configurations
Detect regressions before they reach users
Reason about trade-offs between quality, latency, and cost

Is manual review and one-off prompting enough to compare LLM outputs?

Manual review and ad-hoc prompting still play a role in mature LLM systems, but they stop being sufficient once scale and change are introduced.

Spot-checking outputs is inherently narrow. Reviewers see a small slice of traffic, often curated or synthetic, and their judgments are shaped by context that may not exist in real usage. As prompts and models evolve, those reviews quickly become outdated. What was “approved” last week may no longer reflect what users are seeing today.

One-off prompting has similar limitations. Testing a handful of inputs against a new model or prompt can reveal obvious failures, but it does not capture variance. Non-determinism means two runs of the same prompt can produce meaningfully different outputs. Without repetition and aggregation, it’s impossible to tell whether a change actually improved quality or simply produced a better-looking example.

At scale, manual review also introduces inconsistency. Different reviewers apply different standards, and those standards shift over time. There is no durable baseline, no reliable way to compare runs across days or deployments, and no systematic way to surface regressions early.

Manual review remains useful for:

Calibrating evaluation criteria
Investigating edge cases
Providing high-quality labels for difficult tasks

But as the primary mechanism for comparing LLM outputs, it breaks down. Production systems need methods that are repeatable, aggregatable, and resilient to change.

Metrics to consider when comparing LLM outputs

The core dimensions:

Effective LLM evaluation requires metrics that map to the actual risks and goals of your application. Most production systems care about three primary dimensions:

_1. whether the output is factually grounded (hallucination detection),

whether it addresses what the user actually asked (relevance and correctness), and
whether it meets safety and quality standards (toxicity, coherence, formatting)._

These dimensions are not interchangeable. A response can be highly relevant but factually wrong, or factually correct but unhelpful. Choosing which metrics to track depends on what failure modes matter most for your use case.

Deterministic vs. LLM-based metrics:

Metrics fall into two categories: deterministic and model-based.

Deterministic metrics like regex matching, JSON validation, and keyword presence are fast, cheap, and predictable. They work well for structured outputs or hard constraints.

Model-based metrics use an LLM to judge another LLM's output, which is necessary for evaluating qualities like coherence, completeness, or whether an answer is actually helpful. LLM-as-a-judge approaches are more expensive and introduce their own variance, but they scale where human review cannot. In practice, most teams use both: deterministic checks for objective criteria, model-based evaluation for subjective quality.

Choosing the right granularity:

Metrics also differ in scope.

Span-level evaluation scores individual LLM calls, useful for isolating which step in a pipeline is underperforming.
Trace-level evaluation looks at the full chain of reasoning, which matters when correctness depends on multiple steps working together.
Session-level evaluation captures user experience across a conversation, surfacing issues like frustration or confusion that only emerge over time.

Comparing outputs effectively means instrumenting at the right level and aggregating results in ways that surface actionable patterns, not just individual failures.

How Arize approaches evals

The evaluation loop:

Arize treats evaluation as an operational loop, not a one-time gate. Evaluations attach directly to traces, so every LLM call can be scored and explained in context. This means teams can run evaluations offline during development, testing prompt changes against datasets before deployment, and then transition the same evaluators to run online against production traffic. The result is a continuous feedback mechanism: evaluations surface regressions, traces provide the context to diagnose them, and experiments validate fixes before they ship.

Pre-built and custom evaluators:

The platform includes pre-built LLM-as-a-judge evaluators for common concerns: hallucination, relevance, toxicity, summarization quality, code correctness, and user frustration, among others. These templates are benchmarked against golden datasets with known precision and recall, so teams can deploy them with confidence. For domain-specific needs, custom evaluators can be defined in plain language, describing what "good" looks like in the same way you would brief a human reviewer, or implemented as deterministic code for objective criteria. Evaluators are versioned and reusable across projects, which keeps evaluation criteria consistent as systems evolve.

Explainability and action:

Every evaluation produces not just a label or score, but an explanation. This matters because knowing that an output was marked "hallucinated" is less useful than understanding why: which claim was unsupported, which context was missing. Explanations make evaluations actionable: they point teams toward specific retrieval failures, prompt gaps, or model limitations. Combined with trace-level observability, this turns evaluation from a reporting mechanism into a diagnostic tool that accelerates iteration.

Completing the loop with Portkey: routing and orchestration for evaluations

Running meaningful LLM evaluations requires more than scoring outputs in isolation. Teams need a reliable way to orchestrate how different models, prompts, and configurations are exercised so that comparisons are intentional and repeatable.

By acting as a routing layer for LLM traffic, Portkey's AI Gatewaymakes this orchestration straightforward. Multiple models and providers can be invoked through a single, consistent API, allowing teams to evaluate alternatives side by side without changing application code. The same inputs can be routed to different models, prompt versions, or configurations as part of structured evaluation runs.

This becomes especially useful when evaluations move closer to production. Instead of maintaining separate pipelines for testing and live traffic, teams can reuse the same routing logic to:

Compare models under identical conditions
Test prompt changes against real inputs
Run controlled experiments alongside production traffic

Observability ensures that these evaluations remain interpretable. Every request captures the context needed to explain outcomes: which model was used, how it was routed, which prompt version was applied, and how the system performed in terms of latency and cost.

Together, routing and observability turn evaluations into an operational loop. Results from Arize evaluations can be traced back to the exact model and routing decisions that produced them, making it easier to act on those insights and iterate with confidence.

To try it out, get started with Portkey for free (Portkey is open source!)

Book a demo with our experts if you're looking to deploy AI Agents in production!

Open AI Responses API vs. Chat Completions vs. Anthropic Messages API

Drishti Shah — Mon, 23 Feb 2026 14:27:32 +0000

The LLM API landscape has never been more fragmented, or more consequential. As teams move from prototypes to production, the choice of which API format to build on shapes your vendor flexibility, your codebase complexity, and how quickly you can swap models when something better comes along.

Today, three API formats dominate how AI Agents talk to LLMs:

OpenAI's Chat Completions API — the de facto standard, universally supported
OpenAI's Responses API — the newer, agent-oriented evolution with built-in tools and state management
Anthropic's Messages API — Claude's native interface, with capabilities like extended thinking and prompt caching

Each was designed with different goals in mind. Understanding the differences affects how you build, how you scale, and how locked in you are to a single provider.

Portkey supports all three natively, and that's where standardization starts to matter.

Open AI Responses API vs. Chat Completions vs. Messages API: At a glance

	Chat Completions	Responses API	Messages API
Provider	OpenAI	OpenAI	Anthropic
Endpoint	`POST /v1/chat/completions`	`POST /v1/responses`	`POST /v1/messages`
Design goal	Stateless text generation	Agentic workflows with built-in tools	Claude-native capabilities
State management	Manual	Optional server-side (with `store: true`)	Manual
Streaming	✅	✅	✅
Tool / function calling	✅	✅ (with built-in tools)	✅
Built-in web search	❌	✅	✅ (via server tools)
Extended thinking	❌	❌	✅ (Claude only)
Prompt caching	❌	❌	✅ (with `cache_control`)
Computer use	❌	✅	✅
Ecosystem compatibility	Widest	Growing	Claude-specific

What makes each endpoint different

Chat Completions: the universal standard

Chat Completions (POST /v1/chat/completions) is where everything started. You send an array of messages, each with a role (system, user, assistant, or tool for tool call results) and the model replies. It's stateless by design, so you own the conversation history and pass it with every request.

This simplicity is its biggest strength. Because the model has no memory between calls, you have full control over what context it sees. And because practically every major provider has adopted this format, code written against Chat Completions works across OpenAI, Anthropic (via adapters), Gemini, Mistral, Bedrock, and other models with minimal changes.

What it does well:

Widest ecosystem of tools, frameworks, and libraries
Predictable, well-understood response format
Easiest path to switching providers or running multi-provider setups

What it doesn't do:

No built-in tools. Web search, code execution, file search all need external orchestration
No native support for extended reasoning or prompt caching
No server-side state, you manage conversation history entirely

The response object returns choices, where each choice contains a message with role: "assistant" and content. Tool calls come back in tool_calls. Clean and predictable, which is why it became the lingua franca of LLM APIs.

When to use it: When your use case is primarily text generation i.e., chatbots, summarization, classification, content generation, Q&A. It's the right default if you're using frameworks like LangChain or LlamaIndex that abstract over providers, or if cross-provider portability matters.

Responses API: built for agents

The Responses API (POST /v1/responses) takes a different approach. It's designed to run agentic loops as the model can call multiple built-in tools (web search, file search, code interpreter, computer use, remote MCP servers) within a single API request, without you orchestrating each step.

State management comes in two forms. With previous_response_id, you chain responses by referencing a prior response ID and the model picks up context without you resending the full history, but you're still tracking the ID yourself. The newer Conversations API goes further, maintaining a durable conversation object server-side that automatically accumulates turns across sessions.

What it does well:

Built-in tools that run within a single request, no external orchestration needed
previous_response_id chains turns without resending prior tokens
Designed for multi-step agentic workflows where context and tool results accumulate
Better cache utilization compared to Chat Completions for repeated context

What it doesn't do:

Natively available on OpenAI models only (though Portkey makes it work across providers)
More complex response structure, output is an array of typed items rather than a single message
Overkill for simple single-turn completions

When to use it: When you're building autonomous agents that use built-in tools, or multi-turn workflows where you want to reduce token overhead across turns. It's the right choice when agentic behavior and tool use are core to your application.

Messages API: Claude's native interface

Anthropic's Messages API (POST /v1/messages) is designed around how Claude works. While it shares surface similarities with Chat Completions, it exposes capabilities that are specific to Claude and don't exist in OpenAI's formats.

What it does well:

Extended thinking: Claude returns type: "thinking" content blocks before the final answer, exposing its reasoning process.
Prompt caching: Fine-grained cache_control lets you cache specific content blocks (with 5-minute or 1-hour TTLs), reducing latency and cost significantly for repeated context
Rich content blocks: The content array supports text, images, PDFs, tool use, thinking blocks, and citations pointing to source documents
Stop reason granularity: stop_reason can be end_turn, max_tokens, stop_sequence, tool_use, pause_turn, or refusal
Native web search: Pass {"type": "web_search_20250305", "name": "web_search"} in the tools array and Claude handles execution server-side, returning results in a server_tool_use response block

What it doesn't do:

No server-side state management (you manage history yourself)
Not natively compatible with non-Anthropic providers without a translation layer

The response object returns a content array of typed blocks. A single response might include a thinking block, a text block, and a tool_use block in sequence. Citations on text blocks tell you exactly which document or character range the model drew from.

When to use it: When you're building specifically on Claude and need extended thinking for complex reasoning, prompt caching for document-heavy workloads, or reasoning transparency in your application.

How Portkey supports all three

Most teams end up needing more than one of these formats. You might use Chat Completions for a general assistant, the Responses API for an autonomous agent, and Claude's Messages API for a document reasoning pipeline, all in the same AI Agent.

Building direct integrations with each provider means separate SDKs, separate observability, and code that breaks every time you want to try a new model.

Portkey sits between your application and every provider, handling the translation so you don't have to.

The best part: you can use any of the three API formats with any provider and model.

Want to use the Messages API format but route to a Gemini model? Portkey handles the transformation. Want Chat Completions format but call a Claude model? Same thing.

Beyond format flexibility, Portkey adds what direct API access can't give you:

Observability: Every request logged, traced, and searchable across all providers and API formats
Fallbacks and load balancing: Route to a backup provider if your primary is down or rate-limited
Prompt management: Version, test, and deploy prompts centrally
Cost tracking: Unified spend view across providers and models
Governance: Enterprise controls over which teams access which models

Making calls with each API through Portkey

Chat Completions

from portkey_ai import Portkey

portkey = Portkey(api_key="PORTKEY_API_KEY")

response = portkey.chat.completions.create(
    model="@openai-provider/gpt-5.2",
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)

print(response.choices[0].message.content)

Responses API

from portkey_ai import Portkey

portkey = Portkey(api_key="PORTKEY_API_KEY")

response = portkey.responses.create(
    model="@openai-provider/gpt-4o",
    input="Explain quantum computing in simple terms"
)

print(response.output_text)

Messages API

import anthropic

client = anthropic.Anthropic(
    api_key="PORTKEY_API_KEY",
    base_url="https://api.portkey.ai"
)

message = client.messages.create(
    model="@anthropic-provider/claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quantum computing in simple terms"}]
)

print(message.content[0].text)

Switching providers without changing your code

The real payoff is when you want to swap providers.

To move from OpenAI to Claude on Responses, all you need to do is change the provider name and model:

from portkey_ai import Portkey

portkey = Portkey(api_key="PORTKEY_API_KEY")

response = portkey.responses.create(
    model="@anthropic-provider/claude-sonnet-4-5-20250514",
    input="Explain quantum computing in simple terms"
)

print(response.output_text)

Your application code, your observability, your fallback logic, none of it changes. The format stays the same and Portkey's universal API handles the translation to whichever provider you're routing to.

Getting started

All three API formats work through Portkey with a single configuration change, pointing your SDK's base URL at Portkey's gateway. From there, routing, translation, observability, and reliability are handled for you.

Get started with Portkey | Read the docs

To see how explore how Portkey can support your AI strategy, book a demo here.