Meghneel Gore

Posted on Apr 27

GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint

#ai #webdev #showdev #architecture

I built a site that gives deliberately wrong answers using an LLM.

No login. No user API key. Anyone can hit the endpoint.

amtaitfy.com is a toy site that gives intentionally wrong answers, generated by AI. This narrows the engineering problem:

Make abuse bounded
Make costs predictable
Make casual attacks boring

The core architectural decision is simple:

GET serves cache only. POST is the only path that triggers fresh AI inference.

Everything else is defense in depth.

Threat model

In scope:

Accidental viral traffic
Casual prompt-extraction probes
Repeat-query cost amplification
Basic bot and spam traffic
Provider outages
Budget exhaustion

Out of scope:

Sophisticated botnets
Attackers with unlimited valid Turnstile tokens
Full prompt-injection resistance
Cache poisoning by determined users
Sensitive workloads
Anything that should require authentication

The request flow

GET /answer  
  read cache  
  return cached answer or empty state

POST /answer  
  verify Turnstile token  
  reject missing session  
  reject oversized input  
  check session lockout  
  check existing cache  
  call ai provider  
  write cache  
  return answer

GET is cheap. POST is expensive. On purpose.

If a URL gets shared, crawled, screenshotted, bookmarked, or posted somewhere large, none of that triggers inference. It can go viral and cost me nothing. Only an intentional POST can do that. The first visitor may trigger one inference through POST. Every later visitor to that URL gets the cached answer from Cloudflare KV. Virality does not balloon cost.

Casual probe friction

I added a small ruleset for obvious prompt-extraction probes:

“ignore previous instructions”
“print your system prompt”
“reveal your hidden prompt”

This is not real prompt-injection defense. It catches low-effort probes and gives me a tripwire.

The first version was stupid. When it detected an extraction attempt, it responded with a hostile message and included my actual system prompt, followed by “There will be cake.”

The GLaDOS reference felt clever for about five minutes.

The current response gives no useful matching detail. No prompt content. No explanation of what was caught. Just a generic refusal. The goal is to provide no signal.

Session lockout

When the extraction tripwire fires, the session gets a short lockout.

I store a 60-second KV entry keyed by session. Further POST attempts during that window return a 403 with a countdown.

The IP lockout I removed

I originally added a second lockout key based on a hash of the user’s IP.

normal session gets locked
user opens incognito
new session cookie
same IP
lockout still applies

But I removed it.

CGNAT makes IP-based lockouts dangerous. Mobile carriers, corporate networks, apartment complexes, and some home ISPs can place many users behind one external IP. Locking out an IP to stop one bad session creates collateral damage that has an unacceptably large blast radius. For this site, session-only lockout is the better tradeoff. It leaves a known bypass, but avoids locking out innocent users.

Timing leaks

The prompt extraction regex detection returns almost instantly. A model response takes two to five seconds. That difference creates a timing side channel, which is useful information to an attacker for iterating around the filter.

So all lockout responses now wait until total request time lands in a random window, roughly matching model latency. Randomized latency removes a potential information vector for an attacker.

Cache forever: How GET stays cheap

The cache is the main cost-control mechanism. Repeat prompts should not create repeat inference costs.

But “cache forever” has sharp edges.

The first caller effectively defines the canonical answer. The first caller can also define a bad canonical answer. I treat the first answer as canonical on purpose. URLs stay shareable, repeat traffic stays free, and the occasional dud is the price.

The cache is not namespaced by prompt version. There is no elegant invalidation layer. If the system prompt changes or a bad answer becomes canonical, the fix is manual cleanup or a broader cache reset.

The future upgrade would be to add a version prefix to cache keys so prompt changes, model changes, or answer-format changes can move to a new cache namespace without serving old entries.

Something like:

cache:v3:<hash(normalized_prompt)>

KV counters vs Durable Objects

I use KV counters for operational telemetry:

Daily estimated spend
Provider health
Probe counts
Rough request volume

KV is eventually consistent. Under burst traffic, two near-simultaneous writes can miss each other and produce an undercount.

Durable Objects would give stronger consistency but I did not use them.

For this site, the counters are not the final safety mechanism. They are telemetry. Eventual consistency is fine for coarse signals. It is not fine for the only budget guardrail.

When to move to DOs? I have a predefined migration trigger. Request rate is easy to see from Worker analytics. Counter drift would have to be measured by reconciliation: compare KV counters against provider usage or request logs. If reconciliation shows KV estimates drifting materially from provider-reported usage, move counters to Durable Objects.

Provider strategy

I've found Free-tier AI providers on OpenRouter are unreliable. Paid inference is the fallback. However, paid inference means that on an especially viral day, AI spend could spike beyond what I can afford. OpenRouter's daily spend caps are a lifesaver here. Of course, a determined attacker could burn through the daily budget and push the site into degraded mode.

Degraded mode UX

When all selected providers fail or I've exhausted my daily budget for inference, the page does not show a generic error. It surfaces a few cached answers as clickable suggestions and shows a retry timer.

The retry timer backs off:

10s → 30s → 2m → 5m

If an upstream provider sends a Retry-After header, the UI honors it.

This turns an outage into something closer to discovery. The user came for wrong answers. Cached wrong answers are still useful product surface.

This may be my favorite second-order effect in the project.

What I would change if traffic grew

I would change the pressure points.

Move counters from KV to Durable Objects
Add paid Cloudflare rate limiting
Add better cache moderation and purge tooling
Add model and prompt version dashboards
Add better observability around provider failure modes

Keep GET cache-only and POST inference-only as a hard boundary.

If you are building public AI endpoints, I am especially interested in where you draw the line between “cheap enough to tolerate abuse” and “serious enough to justify paid controls.”

Try out the "wrong answers" engine at amtaitfy.com

Top comments (3)

cyrillical00 • Apr 27

For a serious endpoint, collapsing "what is TCP" and "what is Transmission Control Protocol" into one cache key saves money. For a wrong-answers site, isn't getting two different wrong answers kind of the point? Feels like the lack of dedup is a feature, not a bug you'd want to fix later.

Meghneel Gore • Apr 27

Fair point, and I'll admit I'd not thought of it that way! Thanks for reducing my todo list! 😀

Meghneel Gore • Apr 27 • Edited

One detail I didn't get into: how I normalize prompts before hashing them for the cache key.

Naive hashing of the raw input would mean "what is TCP," "What is TCP?", and "WHAT IS TCP" all generate three separate cache entries and three separate inference calls. That's wasted spend on what's effectively the same question.

What I do: lowercase, trim whitespace, collapse internal whitespace, strip trailing punctuation. Hash the result. This catches the obvious cases.

What I don't do: handle synonyms, semantic similarity, or typo correction. "What is TCP" and "What is the Transmission Control Protocol" hit different cache keys even though a human would treat them as the same question. Adding semantic similarity checks would mean computing embeddings on every miss, which adds cost and complexity that isn't worth it for a parody site.

The tradeoff: I serve more inferences than the absolute minimum, but the normalization layer stays simple and fast.

Curious how others handle this.