Morgan Willis

Posted on Jun 5 • Edited on Jun 9

Inference Theft: Your AI Endpoint Is Someone Else's Free Model

#aws #ai #security #architecture

Earlier this year, a bunch of people figured out they could use a customer service chatbot for a popular fast-food chain as a free coding assistant. It went viral. Some customers came looking for burritos and others left with LeetCode solutions. Everyone got what they wanted except the company paying for the inference.

The chatbot was backed by a capable general-purpose model with no way to enforce what it should and shouldn't answer. If you asked it to invent a novel approach to bubble sort, it would try. The model didn't know it was only supposed to be a burrito bot, it just saw a prompt and responded.

If your AI endpoint doesn't restrict who can sent it requests, and have a way to limit what it will and won't answer, any general-purpose model you expose becomes a general-purpose model for everyone, on your dime.

This is how you become the victim of inference theft.

Inference theft occurs when someone repurposes your AI application as a model endpoint that you never intended to expose. They route requests through your application and let you pay the inference bill. Inference theft is one of the fastest ways to create a denial-of-wallet event.

In a denial-of-wallet attack, the goal isn't to take your application offline. The goal is to make you absorb enough traffic that cost becomes the attack. With AI systems, inference theft is a two-for-one for the attacker: they run up your inference bill and monetize it at the same time.

I was reading a recent article from Vercel on inference theft and their framing lays out the economics: AI requests against frontier models can be a million times more expensive than standard HTTP requests. A traditional API call costs a fraction of a cent, but a single expensive agentic request can cost a dollar or more. That cost asymmetry makes AI endpoints one of the highest-margin targets an attacker can find.

Their recommendation is per-request bot verification, which is a good start. But I think it's one layer in what needs to be a multi-layer approach.

How Does Inference Theft Work?

One form of inference theft looks like this:

An attacker finds an AI endpoint exposed to the internet (your chatbot, your docs assistant, your customer service agent)
They write a thin adapter that wraps your endpoint in a standard model API interface
They sell access to that adapter at 5-10% of what the model provider charges
Their customers send prompts through the adapter, which routes them to your endpoint
You pay the full inference cost for every request

This is exactly what happened to the burrito bot. Someone reverse-engineered its backend, wrapped it in a standard model API, and suddenly anyone could point their coding tools at a fast-food company's inference bill.

The attacker's cost to pull this off is a one-time engineering effort to build the adapter. Your cost is ongoing inference. If you've built web applications before, you probably have a mental playbook for abuse prevention.

You might think standard rate limiting should be able to shed some of the load and reduce the risk. Rate limiting fails because attackers can use residential proxies to spread the load across IPs that all look legitimate.

Then, what about authentication? Make sure the endpoint can only be called with valid tokens. But even if you have authentication in place, attackers can create or use thousands of accounts. And even a single legitimate account can run up serious costs if there's nothing limiting how many expensive tasks it can submit. Authentication tells you who is making the request, but it doesn't tell you how much that person should be allowed to spend.

You might also consider using CAPTCHAs, but CAPTCHAs only help if the problem is bots. Even then, an attacker only needs to solve the CAPTCHA once to unlock thousands of requests behind it. And CAPTCHAs do nothing at all when the abuse comes from real humans doing things your endpoint wasn't designed for, like asking your burrito bot to write Python.

So what do you actually do?

Bot Detection Is Necessary But Not Sufficient

Per-request bot verification is a real defense. For anonymous web-facing endpoints, behavioral analysis can identify large amounts of automated traffic effectively. But it only answers one question: is this request coming from a human?

It doesn't answer: is this request appropriate for an expensive model? Is this user within their budget? Should this request reach a model at all?

Inference theft has multiple shapes, and you need to think about defense at multiple points in your architecture.

Multiple Layers, Not One Gate

In practice, I think about this as a series of questions that get asked before the expensive thing happens:

Is this a human? → Bot verification
Is this allowed content? → Guardrails enforcement
Does this need a big model? → Cost-aware routing
Is this user within budget? → Budget controls
                          ↓
                        Model

Each layer answers a different question, and protects your system from a different angle.

Not every system needs all of these. If you're running an internal tool behind a VPN with trusted users, bot verification might not matter to you. The right combination depends on your threat model, your cost tolerance, and how your endpoint is exposed to the world.

That said, the principle holds: the expensive model should be the last thing in the chain, not the first thing that evaluates whether a request is legitimate.

Use a Proper Front Door

Traditional applications don't expose databases directly to the internet. Requests flow through API gateways, load balancers, authentication layers, and policy engines before they ever reach a backend service.

AI systems need the same pattern.

An API Gateway or AI gateway acts as the front door for your models and agents. Instead of allowing users to interact directly with a model endpoint, requests pass through a centralized layer that can enforce authentication, route traffic, track usage, and make cost-based decisions before inference occurs.

This is the same architectural evolution we went through with APIs. We learned that pushing security, governance, and traffic management into every individual service creates inconsistency and operational overhead. API gateways centralized those concerns. AI gateways are emerging to do the same thing for model traffic.

Whether you build this layer yourself or adopt a managed solution, the principle is the same: users shouldn't be talking directly to expensive models. They should be talking to a system that decides whether a model invocation should happen at all.

Prompts Are Suggestions, Guardrails Are Architecture

Even when requests make it past the first few lines of defense, you should still have other layers to protect against abuse. This is because models don't really have boundaries.

Without guardrails in place, they'll answer whatever you ask them. That's what makes them useful, and it's also what makes them expensive to leave unsupervised or unsecured.

If your customer service bot will happily reverse a linked list, it doesn't have boundaries. And the instinct is to prompt engineer harder. But prompts are suggestions, not real enforcement, and a determined user can get around prompt-level instructions.

The bare minimum is to enforce boundaries at the model layer itself with input and output guardrails for content filtering. On AWS, Amazon Bedrock Guardrails lets you define these policies as configuration rather than code, and other providers offer something similar.

Guardrails inspect inputs and outputs and can prevent requests from reaching the model when policy violations are detected. Guardrails can help you:

Block prompts about topics your application doesn't support
Filter prompt injection attempts
Catch responses that drifted off-topic
Redact PII from model outputs
Reject harmful or policy-violating content in either direction

But guardrails are probabilistic, not deterministic. They'll catch most violations but not all of them. Something will slip through eventually, which is why they work best as one layer among several.

At a talk recently, someone told me their employer didn't want to pay for guardrails because they have their own per-request cost. I get the instinct, but it's the wrong comparison. You're weighing a predictable cost per call against an unpredictable and potentially unbounded one. A denial-of-wallet incident doesn't have a ceiling.

The whole reason these attacks work is cost asymmetry, and adding a cheap check in front of an expensive model is a good pro-active measure. The expensive way to learn this is to skip guardrails, eat a surprise five- or six-figure inference bill, and turn them on anyway. Proactive is cheaper than reactive every single time.

Cost-Aware Routing: Match the Model to the Need

Even within scope, not every request requires the same resources.

Maybe "What are your store hours?" and "Can you help me compare three different catering options for a 200-person event?" are both on-topic. But they have different complexity, and routing them to the same frontier model at the same cost doesn't make sense.

You should be using some sort of tiered routing:

Simple or frequently-asked → cached response or a small model
Medium complexity → a standard model
High complexity → a frontier model

A good routing layer makes the product feel the same to the user while reducing the cost to operate it.

How you implement this depends on your architecture. There are a few patterns worth calling out:

Intent-based routing
Use a lightweight classifier to categorize the complexity of a request before choosing which model handles it. A question about store hours gets flagged as simple and routed to a cached response or a small model. A more complex or open-ended question gets flagged as complex and routed to something more capable.

Agent-based routing
Build a cheap orchestrator agent whose only job is to decide which downstream model or tool handles the request. The orchestrator itself runs on a small model, so cost is low per call. It looks at the input, picks a handler, and passes the request along. This is more flexible than a workflow-based solution but adds a hop and latency.

Managed model routers
Some platforms give you model routing out of the box. On AWS, Amazon Bedrock offers intelligent prompt routing that automatically selects the most cost-effective model for a given request, and several AI gateways and model-routing services offer comparable capabilities. If something like that's available to you, it can be the fastest path to tiered routing with minimal engineering effort.

None of these are mutually exclusive, and none of them are perfect. Intent classifiers can misroute, orchestrator agents add latency, and managed routers give you less control over the decision logic. You'll need to test what works for your traffic patterns and iterate.

The point here is that if every request hits the same expensive model regardless of complexity, you're overpaying for most of them, and you're making yourself a more attractive target for abuse in the process.

Budget Controls: Cost as a Dimension of Access

Even if every request is legitimate, on-topic, and correctly routed, you still need per-user cost controls.

Standard rate limiting doesn't apply as cleanly as you'd like. A standard rate limiter might limit a single user to 90 requests per minute, and for non-AI systems that might be enough to protect downstream components. But if each of those requests costs $0.30 in inference, that's $27 per minute, which is close to a $39,000/day burn rate. The volume is fine, but the cost per request is potentially catastrophic depending on your business.

What you need is cost-aware rate limiting. The pattern looks something like per-user or per-tenant inference budgets:

Normal usage     → full access
Approaching cap  → notify or warn
At cap           → downgrade to cheaper model
Over cap         → queue or deny

This doesn't need to be a binary switch, with inference either on or off. You can have levels to enforcement. You can downgrade model quality, increase latency, require additional verification, or some combination when a user begins to hit their limits. The experience degrades gracefully rather than cutting off abruptly.

This is also important to have even without adversarial users. In this case, you are looking to protect against legitimate users who are consuming more than expected. But without the budget controls, you can't tell the difference, and you can't respond proportionally either way.

Conceptually, implementation comes down to two things: tracking spend per identity, and making access decisions based on that spend.

For tracking, you need something that accumulates cost per user or tenant in near real-time. This could be as simple as a counter in a database that increments with the token cost of each request, or as sophisticated as an event stream that aggregates usage across multiple services. The key is that the system checking the budget needs to know what's already been spent before the next request runs.

For access decisions, you need a policy layer that maps identity to entitlements. What tier is this user on? What's their daily budget? What models are they allowed to access? This is where feature flags and configuration-driven access control come in. You define the rules ("free tier gets 50,000 tokens/day, paid tier gets 500,000") and the system evaluates them on every request.

Observability: Know When You're Being Robbed

Every layer so far is preventive. It stops a bad request before inference runs, but none of them tell you when an attack is actually happening, and the worst way to find out you've been robbed is from the invoice at the end of the month. You need both prevention and detection.

This layer sits underneath all the others. Budget controls already had you tracking spend per identity in near real-time. Observability is that same data but with alerts around it. You should monitor cost-per-user and tokens-per-user over time, and treat a sudden spike the same way you'd treat a spike in 500s or latency. It should page someone.

Some signals to look out for:

A single user or tenant suddenly consuming far more inference than normal
A flood of requests being denied by your guardrails
A spike in requests routed to your most expensive models
Traffic spreading across an unusual range of accounts, IPs, or geographies

The whole advantage of a denial-of-wallet attack, from the perspective of the attacker, is that it looks like legitimate traffic. Everything appears healthy while costs accumulate in the background. Observability is what collapses the time between being robbed and knowing it.

The Full Picture

When you put these layers together:

Bot verification catches automated abuse and scripted traffic
Guardrails enforce content boundaries, filtering harmful or off-topic inputs and outputs
Cost-aware routing prevents over-spending on requests that don't need expensive models
Budget controls prevent any single user or tenant from draining the system
Observability surfaces attacks in progress so you find out from an alert instead of an invoice

Each layer is cheaper to operate than inference on a frontier model, which only runs after the cheaper checks are passed.

No single layer is sufficient on its own. Bot detection misses human abuse, and guardrails miss high-volume but on-topic attacks. Then for routing, it doesn't help if one user sends a million complex requests. Per-user budgets don't help if the requests are coming from thousands of fake accounts. But together, they cover each other's gaps. An attacker who gets past one layer still has to get past the others, and the economics of the attack get worse at every step.

You don't necessarily need all of these on day one, and this isn't an exhaustive list of what could be done. Start with the ones that help mitigate your biggest exposure. If your endpoint is public and anonymous, bot verification is probably first. If your users are authenticated but your domain is specific, guardrails might matter more. Look at where you're most exposed and address that first.

Layered Architectures for AI Security

We've been doing layered defense for decades. There's a CDN, a WAF, a load balancer, authentication, and authorization between the internet and your database. Nobody would route raw internet traffic directly to a production database and rely on the database to decide whether the request is legitimate.

But that's what a lot of AI endpoints look like right now. User, model, response. The backend AI agent is simultaneously the product and the security boundary.

I wrote about this pattern in my article on AI agent architectures. The tendency to put the model at the center of everything and ask it to do jobs that should be handled by cheaper, more deterministic systems. Inference theft is the same problem from a security angle. The backend agent or model shouldn't be deciding who's allowed to use it. That decision should happen long before it's involved.

The architecture doesn't need to be complicated, but it does need to exist. If your AI agent endpoint is exposed to the internet with nothing between the user and a frontier model except a login check, you've built a target with a clear dollar value attached to every request.

But the starting point is simple: stop asking the most expensive and least deterministic component in your system to also be your security boundary. Put cheaper, more effective things in front of it. Make the model the last step, not the first.

Top comments (5)

Alex Shev • Jun 5

Strong framing around "budget as an attack surface." The part I'd add is an intent/cost gate before model routing: first decide whether the request belongs to the product domain, then decide whether it deserves the expensive path.

A lot of teams add rate limits, but forget that cost limits are a product rule, not just an infra rule.

Morgan Willis • Jun 5

Interesting point!

Mudassir Khan • Jun 12

the 'expensive model last in the chain' framing is the one that clicks. we hit this exact footgun building an MCP server — every tool call routed straight to the main model, no topic filter, no budget gate. one user with a high loop count ran up $80 in a single session on what was supposed to be an internal tool.

fix was a cheap classifier up front: $0.0001 per call, reject out of scope before anything expensive runs. turned that $80 edge case into a $0.04 one.

does your gateway handle prompt level intent classification, or is that still a build on top problem?

Rafael Lopes • Jun 6 • Edited

I really enjoyed this article and learned something new. I was thinking about it and i believe the good and old geo-restriction is also useful to avoid exposing your endpoints on where they don’t need to be exposed. Doesn’t work if you’re big and famous (and hence targeted) though, but for emergent workloads it can save some headache!

Rahul S • Jun 11

Geo-restriction's underrated, you're right — it eliminates a huge chunk of noise for free. The gap is residential proxy networks though, which route through IPs within your allowed geography so they pass geo checks cleanly. For emergent workloads especially, checking the infrastructure type of the requesting IP (datacenter/hosting provider vs actual residential ISP) catches what geo-restriction lets through, and it's still way cheaper than a single inference call. You can test how different IPs classify at ipasis.com/scan to see the delta between geo alone vs geo + infrastructure type.