Earlier this year, a bunch of people figured out they could use a customer service chatbot for a popular fast-food chain as a free coding assistant. It went viral. Some customers came looking for burritos and others left with LeetCode solutions. Everyone got what they wanted except the company paying for the inference.
The chatbot was backed by a capable general-purpose model with no way to enforce what it should and shouldn't answer. If you asked it to invent a novel approach to bubble sort, it would try. The model didn't know it was only supposed to be a burrito bot, it just saw a prompt and responded.
If your AI endpoint doesn't restrict who can sent it requests, and have a way to limit what it will and won't answer, any general-purpose model you expose becomes a general-purpose model for everyone, on your dime.
This is how you become the victim of inference theft.
Inference theft occurs when someone repurposes your AI application as a model endpoint that you never intended to expose. They route requests through your application and let you pay the inference bill. Inference theft is one of the fastest ways to create a denial-of-wallet event.
In a denial-of-wallet attack, the goal isn't to take your application offline. The goal is to make you absorb enough traffic that cost becomes the attack. With AI systems, inference theft is a two-for-one for the attacker: they run up your inference bill and monetize it at the same time.
I was reading a recent article from Vercel on inference theft and their framing lays out the economics: AI requests against frontier models can be a million times more expensive than standard HTTP requests. A traditional API call costs a fraction of a cent, but a single expensive agentic request can cost a dollar or more. That cost asymmetry makes AI endpoints one of the highest-margin targets an attacker can find.
Their recommendation is per-request bot verification, which is a good start. But I think it's one layer in what needs to be a multi-layer approach.
How Does Inference Theft Work?
One form of inference theft looks like this:
- An attacker finds an AI endpoint exposed to the internet (your chatbot, your docs assistant, your customer service agent)
- They write a thin adapter that wraps your endpoint in a standard model API interface
- They sell access to that adapter at 5-10% of what the model provider charges
- Their customers send prompts through the adapter, which routes them to your endpoint
- You pay the full inference cost for every request
This is exactly what happened to the burrito bot. Someone reverse-engineered its backend, wrapped it in a standard model API, and suddenly anyone could point their coding tools at a fast-food company's inference bill.
The attacker's cost to pull this off is a one-time engineering effort to build the adapter. Your cost is ongoing inference at full price. If you've built web applications before, you probably have a mental playbook for abuse prevention. The problem is that playbook was written when requests cost fractions of a cent, not dollars.
You might think standard rate limiting should be able to shed some of the load and reduce the risk. Rate limiting fails because attackers can use residential proxies to spread the load across IPs that all look legitimate.
Then, what about authentication? Make sure the endpoint can only be called with valid tokens. But even if you have authentication in place, attackers can create or use thousands of accounts. And even a single legitimate account can run up serious costs if there's nothing limiting how many expensive tasks it can submit. Authentication tells you who is making the request, but it doesn't tell you how much that person should be allowed to spend.
You might also consider using CAPTCHAs, but CAPTCHAs only help if the problem is bots. Even then, an attacker only needs to solve the CAPTCHA once to unlock thousands of requests behind it. And CAPTCHAs do nothing at all when the abuse comes from real humans doing things your endpoint wasn't designed for, like asking your burrito bot to write Python.
So what do you actually do?
Bot Detection Is Necessary But Not Sufficient
Per-request bot verification is a real defense. For anonymous web-facing endpoints, behavioral analysis can identify large amounts of automated traffic effectively. But it only answers one question: is this request coming from a human?
It doesn't answer: is this request appropriate for an expensive model? Is this user within their budget? Should this request reach a model at all?
The fast-food chatbot wasn't being abused by bots, there were real humans sending coding prompts to it. No amount of bot detection would have caught that because the requests were genuinely human.
Inference theft has multiple shapes, and you need to think about defense at multiple points in your architecture.
Multiple Layers, Not One Gate
In practice, I think about this as a series of questions that get asked before the expensive thing happens:
Is this a human? → Bot verification
Is this allowed content? → Guardrails enforcement
Does this need a big model? → Cost-aware routing
Is this user within budget? → Budget controls
↓
Model
Each layer answers a different question, and protects your system from a different angle.
Not every system needs all of these. If you're running an internal tool behind a VPN with trusted users, bot verification might not matter to you. The right combination depends on your threat model, your cost tolerance, and how your endpoint is exposed to the world.
That said, the principle holds: the expensive model should be the last thing in the chain, not the first thing that evaluates whether a request is legitimate.
Use a Proper Front Door
Traditional applications don't expose databases directly to the internet. Requests flow through API gateways, load balancers, authentication layers, and policy engines before they ever reach a backend service.
AI systems need the same pattern.
An API Gateway or AI gateway acts as the front door for your models and agents. Instead of allowing users to interact directly with a model endpoint, requests pass through a centralized layer that can enforce authentication, route traffic, track usage, and make cost-based decisions before inference occurs.
This is the same architectural evolution we went through with APIs. We learned that pushing security, governance, and traffic management into every individual service creates inconsistency and operational overhead. API gateways centralized those concerns. AI gateways are emerging to do the same thing for model traffic.
Whether you build this layer yourself or adopt a managed solution, the principle is the same: users shouldn't be talking directly to expensive models. They should be talking to a system that decides whether a model invocation should happen at all.
Prompts Are Suggestions, Guardrails Are Architecture
Even when requests make it past the first few lines of defense, you should still have other layers to protect against abuse. This is because models don't really have boundaries.
Without guardrails in place, they'll answer whatever you ask them. That's what makes them useful, and it's also what makes them expensive to leave unsupervised or unsecured.
If your customer service bot will happily reverse a linked list, it doesn't have boundaries. And the instinct is to prompt engineer harder: "Only answer questions about our menu." But prompts are suggestions, not real enforcement, and a determined user can get around prompt-level instructions.
The bare minimum is to enforce boundaries at the model layer itself with input and output guardrails for content filtering. On AWS, Amazon Bedrock Guardrails lets you define these policies as configuration rather than code, and other providers offer something similar.
Guardrails inspect inputs and outputs and can prevent requests from reaching the model when policy violations are detected. Guardrails can help you:
- Block prompts about topics your application doesn't support
- Filter prompt injection attempts
- Catch responses that drifted off-topic
- Redact PII from model outputs
- Reject harmful or policy-violating content in either direction
But guardrails are probabilistic, not deterministic. They'll catch most violations but not all of them. Something will slip through eventually, which is why they work best as one layer among several rather than your only line of defense.
At a talk recently, someone told me their employer didn't want to pay for guardrails because they have their own per-request cost. I get the instinct, but it's the wrong comparison. You're weighing a predictable cost per call against an unpredictable and potentially unbounded one. A denial-of-wallet incident doesn't have a ceiling, but your guardrail bill does.
The whole reason these attacks work is cost asymmetry, and adding a cheap check in front of an expensive model puts that asymmetry back on your side. If you've decided you need boundary enforcement, cost is a weak reason to skip the layer that provides it. The expensive way to learn this is to skip guardrails, eat a surprise five- or six-figure inference bill, and turn them on anyway. Proactive is cheaper than reactive every single time.
How you define what's in-scope is specific to your domain, there's no universal answer. But if you haven't defined it and found ways to enforce it, you're relying on the model itself to police its own usage. That's the most expensive bouncer you could possibly hire.
Cost-Aware Routing: Match the Model to the Need
Even within scope, not every request requires the same resources.
Maybe "What are your store hours?" and "Can you help me compare three different catering options for a 200-person event?" are both on-topic. But they have different complexity, and routing them to the same frontier model at the same cost doesn't make sense.
You should be using some sort of tiered routing:
- Simple or frequently-asked → cached response or a small model
- Medium complexity → a standard model
- High complexity → a frontier model
A good routing layer makes the product feel the same to the user while reducing the cost to operate it.
How you implement this depends on your architecture. There are a few patterns worth calling out:
Intent-based routing
Use a lightweight classifier to categorize the complexity of a request before choosing which model handles it. A question about store hours gets flagged as simple and routed to a cached response or a small model. A more complex or open-ended question gets flagged as complex and routed to something more capable.
Agent-based routing
Build a cheap orchestrator agent whose only job is to decide which downstream model or tool handles the request. The orchestrator itself runs on a small model, so cost is low per call. It looks at the input, picks a handler, and passes the request along. This is more flexible than a workflow-based solution but adds a hop and latency.
Managed model routers
Some platforms give you model routing out of the box. On AWS, Amazon Bedrock offers intelligent prompt routing that automatically selects the most cost-effective model for a given request, and several AI gateways and model-routing services offer comparable capabilities. If something like that's available to you, it can be the fastest path to tiered routing with minimal engineering effort.
None of these are mutually exclusive, and none of them are perfect. Intent classifiers can misroute, orchestrator agents add latency, and managed routers give you less control over the decision logic. You'll need to test what works for your traffic patterns and iterate.
The point here is that if every request hits the same expensive model regardless of complexity, you're overpaying for most of them, and you're making yourself a more attractive target for abuse in the process.
Budget Controls: Cost as a Dimension of Access
Even if every request is legitimate, on-topic, and correctly routed, you still need per-user cost controls.
Standard rate limiting doesn't apply as cleanly as you'd like. A standard rate limiter might limit a single user to 90 requests per minute, and for non-AI systems that might be enough to protect downstream components. But if each of those requests costs $0.30 in inference, that's $27 per minute, which is close to a $39,000/day burn rate. The volume is fine, but the cost per request is potentially catastrophic depending on your business.
What you need is cost-aware rate limiting. The pattern looks something like per-user or per-tenant inference budgets:
Normal usage → full access
Approaching cap → notify or warn
At cap → downgrade to cheaper model
Over cap → queue or deny
This doesn't need to be a binary switch, with inference either on or off. You can have levels to enforcement. You can downgrade model quality, increase latency, require additional verification, or some combination when a user begins to hit their limits. The experience degrades gracefully rather than cutting off abruptly.
This is also important to have even without adversarial users. In this case, you are looking to protect against legitimate users who are consuming more than expected. But without the budget controls, you can't tell the difference, and you can't respond proportionally either way.
Conceptually, implementation comes down to two things: tracking spend per identity, and making access decisions based on that spend.
For tracking, you need something that accumulates cost per user or tenant in near real-time. This could be as simple as a counter in a database that increments with the token cost of each request, or as sophisticated as an event stream that aggregates usage across multiple services. The key is that the system checking the budget needs to know what's already been spent before the next request runs.
For access decisions, you need a policy layer that maps identity to entitlements. What tier is this user on? What's their daily budget? What models are they allowed to access? This is where feature flags and configuration-driven access control come in. You define the rules ("free tier gets 50,000 tokens/day, paid tier gets 500,000") and the system evaluates them on every request. When the budget runs out, the policy kicks in: downgrade, queue, or deny.
Observability: Know When You're Being Robbed
Every layer so far is preventive. It stops a bad request before inference runs, but none of them tell you when an attack is actually happening, and the worst way to find out you've been robbed is from the invoice at the end of the month. You need both prevention and detection.
This layer sits underneath all the others. Budget controls already had you tracking spend per identity in near real-time. Observability is that same data but with alerts around it. You should monitor cost-per-user and tokens-per-user over time, and treat a sudden spike the same way you'd treat a spike in 500s or latency. It should page someone.
Some signals to look out for:
- A single user or tenant suddenly consuming far more inference than normal
- A flood of requests being denied by your guardrails
- A spike in requests routed to your most expensive models
- Traffic spreading across an unusual range of accounts, IPs, or geographies
The same instrumentation also helps you tune every other layer. It tells you where your classifier is misrouting, where guardrails are maybe too strict, where budgets are too loose or too restrictive, and where routing decisions aren't matching reality.
The whole advantage of a denial-of-wallet attack, from the perspective of the attacker, is that it looks like legitimate traffic. Everything appears healthy while costs accumulate in the background. Observability is what collapses the time between being robbed and knowing it.
The Full Picture
When you put these layers together:
- Bot verification catches automated abuse and scripted traffic
- Guardrails enforce content boundaries, filtering harmful or off-topic inputs and outputs
- Cost-aware routing prevents over-spending on requests that don't need expensive models
- Budget controls prevent any single user or tenant from draining the system
- Observability surfaces attacks in progress so you find out from an alert instead of an invoice
Each layer is cheaper to operate than inference on a frontier model, which only runs after the cheaper checks are passed.
No single layer is sufficient on its own. Bot detection misses human abuse, and guardrails miss high-volume but on-topic attacks. Then for routing, it doesn't help if one user sends a million complex requests. Per-user budgets don't help if the requests are coming from thousands of fake accounts. But together, they cover each other's gaps. An attacker who gets past one layer still has to get past the others, and the economics of the attack get worse at every step.
You don't necessarily need all of these on day one, and this isn't an exhaustive list of what could be done. Start with the ones that help mitigate your biggest exposure. If your endpoint is public and anonymous, bot verification is probably first. If your users are authenticated but your domain is specific, guardrails might matter more. Look at where you're most exposed and address that first.
Layered Architectures for AI Security
We've been doing layered defense for decades. There's a CDN, a WAF, a load balancer, authentication, and authorization between the internet and your database. Nobody would route raw internet traffic directly to a production database and rely on the database to decide whether the request is legitimate.
But that's what a lot of AI endpoints look like right now. User, model, response. The backend AI agent is simultaneously the product and the security boundary.
I wrote about this pattern in my article on AI agent architectures. The tendency to put the model at the center of everything and ask it to do jobs that should be handled by cheaper, more deterministic systems. Inference theft is the same problem from a security angle. The backend agent or model shouldn't be deciding who's allowed to use it. That decision should happen long before it's involved.
The architecture doesn't need to be complicated, but it does need to exist. If your AI agent endpoint is exposed to the internet with nothing between the user and a frontier model except a login check, you've built a target with a clear dollar value attached to every request.
But the starting point is simple: stop asking the most expensive and least deterministic component in your system to also be your security boundary. Put cheaper, more effective things in front of it. Make the model the last step, not the first.
Top comments (0)