TL;DR: A customer asked my AI sales bot "what do you have?" and the bot listed product categories the store doesn't sell. My instinct was to rewrit...
For further actions, you may consider blocking this person and/or reporting abuse
This is the single most under-appreciated debugging skill for LLM systems: before you touch any code, point at the exact span of text the model is reading and ask whether you would draw the same conclusion from it. I lost a full afternoon last month on a similar one — a system prompt that said "Customer:" before each turn, then later "User profile:" in front of metadata. The model started attributing the user's bio to the customer's intent. Same shape as your
Store:label collision.The label-as-contract framing is gold. We started running a habit where every prompt field that touches the model has to answer two questions explicitly: what kind of thing is this (catalog, voice, history, hint), and what is the model forbidden to do with it. It feels heavy at first but it kills exactly this category of bug. Curious whether you've added eval cases for the regression — once you've seen a "suits" hallucination once, you basically have a free seed for an adversarial test set.
The "Customer:" / "User profile:" collision is exactly the same shape — different surface, identical bug. Two labeled fields competing for the model's attention, the more specific frame loses. Adding it to my mental catalog.
The two-question contract is the part that generalizes hardest. My article's step 2 ("look at the label that introduces the span") was diagnostic — it tells you where to look once the bug exists. Your contract is prescriptive — it tells you what to write so the bug never gets written in. The "what kind of thing is this" half is what I patched with the rename. The "what is the model forbidden to do with it" half is what I didn't have, and that's the half that kills the bug class, because it forces you to articulate the failure mode at the moment of injection, when the labeling decision is still hot. Xidao's identity-vs-reference section split (thread above) is the structural twin: sections give you the slots, contracts give you the spec for what each slot is allowed to mean.
On eval cases: you're right, and I owe my future self this. No harness yet. But I think the seed isn't literally "suits" — it's the shape "marketing prose containing noun-list category language injected into a labeled field." Generalize that and you have a test class, not a test case. The wrinkle (chewing on it in Xidao's thread) is that the adversarial inputs aren't from attackers — they're from merchants writing well-intentioned SEO copy. The accidentally-adversarial shape is the production distribution, which makes synthetic generation harder than it looks. The real test set has to be drawn from real tenant data.
This is a really great debugging story and highlights something I've run into as well — the semantic weight of variable labels in system prompts is massively underestimated.
I had a similar experience where I was injecting user profile data with a label like
User preferences: ...and the model started treating it as hard constraints rather than soft hints. The fix was almost identical to yours: relabeling it to something that communicated the right level of authority.One pattern I've started using is a kind of "role annotation" for each injected context block — instead of just
Store:, I use something like[CONTEXT: brand description, not inventory]or[GROUNDING: product table from database]. It adds a few tokens but makes the intended interpretation much more explicit. Some models handle this better than others, though — I've noticed Claude tends to respect these annotations more reliably than some open-source models.The broader lesson here is that prompt debugging really is a different discipline from code debugging. When the output is wrong, the instinct is to look at the architecture. But sometimes the architecture is fine and the "configuration" (i.e., the prompt) is the bug. Thanks for sharing this — it's a great case study for anyone building grounded LLM systems.
The bracketed role annotation is the next layer down from rename-the-label — it makes the role machine-readable in a way Store: never was, and gives you a hook for tooling later (grep prompts for [GROUNDING:] and verify each one is sourced from a real retrieval call).
The Claude-vs-open-source observation tracks. My working hypothesis: instruction-following on negative constraints ("this is NOT inventory") correlates with the depth of RLHF on multi-turn adherence. Older open-source models read [CONTEXT: not inventory] and still happily pull inventory from it because the negative scope blurs into general context. Frontier models hold the boundary more reliably. Curious whether you've seen the delta widen on long context (10K+ tokens), where I'd expect even the well-tuned models to start leaking.
On the closer — "prompt debugging is a different discipline" — agreed, with one refinement: it's a discipline that includes code debugging at the seam where the prompt is assembled. The bug here lived in TypeScript — a template literal with a vague label. The model just surfaced it. Code review doesn't catch "this label is too ambiguous" because there's no compile error. The bracketed annotations are exactly the move that makes the ambiguity inspectable.
This is a great writeup of a class of bug that is really underappreciated in LLM-powered products. The variable labeling issue you describe — where the model treats any labeled context as structured truth — is something I have hit multiple times in multi-tenant setups.
The fix of renaming the label to explicitly disclaim its purpose is clever and pragmatic. One additional pattern I have found helpful is splitting system context into two clearly separated sections: one for "identity and rules" (always authoritative) and one for "reference data" (explicitly marked as non-authoritative background). Even then, some models will still occasionally blur the boundary, so I tend to add a lightweight output validation layer as a second line of defense.
Do you have a set of adversarial test cases for this kind of context injection ambiguity, or was this caught purely through manual testing? Curious how you approach regression testing for prompt-level bugs like this.
The identity-and-rules vs reference-data split is cleaner than what I did. I fixed at the label level; you're describing the fix at the section level. The label rename worked because there was one offending injection. The section split is what survives once you're injecting five or six context blocks and any of them could quietly read as authoritative.
Honest answer on testing: this was caught manually — and only because I'd developed a habit of reading the literal prompt the model sees before touching the architecture. The "CRITICAL: list ONLY categories from search results" rule is the regression patch, but it's a static guard in the prompt, not a test suite. No adversarial harness yet. Closest thing is dogfooding against real merchant descriptions in dev mode, which catches some of these and obviously misses the long tail.
One wrinkle I keep getting stuck on with adversarial testing in multi-tenant: the reference data isn't mine. It's whatever the merchant types into their description field. Even with a clean section split, the words in the reference section can be accidentally adversarial — a merchant writes "we have everything from suits to shoes" as marketing copy and the model reads it as inventory. The injection isn't malicious; it's shaped like inventory because Shopify-style descriptions tend to be.
How do you build the adversarial set when the "adversaries" are also your tenants providing well-intentioned data? Pull from real tenant descriptions, or synthetic generation?
The framing — "architecture builds the truth, prompt decides whether the model believes it" — is exactly right. It points at a deeper problem: prompt labels are runtime metadata that competes with data, and label ambiguity is invisible until the model makes the wrong choice. When a domain object defines its own typed fields (name, price, category) rather than injecting a freeform description block, the schema itself becomes the epistemic frame — there's no ambiguity because the structure already defines what each value is. Your diagnostic question ("where in the bytes sent to the model does the wrong answer originate?") is solid; the next level is pushing role-definition upstream into the data model so you never need to ask it at debug time. That's the principle exomodel.ai is built on — schema as ground truth, not prompt labels.
Schema-as-epistemic-frame is right in pure form — typed fields don't need labels because the structure already declares the role. The wrinkle is that the bug class doesn't live inside the schema; it lives at the seam where freeform text fields get injected into prompts.
store.description in my case is a typed TEXT field. The schema knows it's a description. The model doesn't — once the value gets stringified into a prompt, it's just prose with whatever label you wrap it in. Same shape reappears with product descriptions, return policies, FAQ entries — anywhere a user supplies freeform text that has to be surfaced to the model. You can push role-definition upstream right up to the boundary where prose enters the prompt; past that point, you're labeling again.
So less "schema vs prompt labels" and more "schema decides what fields exist; labels decide how their values are framed when they cross into the model's context." Both layers, both load-bearing.
Fair refinement — and it sharpens the boundary precisely. Schema decides structure; labels decide framing at injection time. Both load-bearing, agreed.
Where it gets interesting: the seam you're describing (typed field → stringified prose → model has no idea what it is) is really a serialization problem. The schema knows store.description is a description; the serializer forgets it. So the question becomes whether the serialization layer can carry that semantic signal forward — not just the value, but its role.
One approach: the field itself participates in prompt construction based on its own definition, not a wrapper label applied externally at call time. That compresses the problem — when something breaks, you fix the field declaration, not the template.
The layer I underplayed earlier: you can also attach documents directly to the object — RAG scoped to the model itself, not bolted onto the prompt pipeline. For freeform fields specifically, that's where domain-specific grounding lives: not a label that says "this is a description," but actual domain knowledge that informs how the value should be interpreted. Schema defines structure; attached corpus defines epistemic context. Both layers, yes — but the second doesn't have to be a prompt label.
That's the cleaner frame. The bug really is a serialization problem — the schema knows the role of every field; the template forgets and has to be told again by hand. Every "Description:" label in the prompt is a manual recovery of information the type system already had.
Field-level prompt participation is the right architectural shift. The field knows what it is. The field generates its own framing. The bug class I hit becomes a schema validation problem, not a "did I label this right in the template" problem. That moves the failure surface from "across N prompts" to "in one field declaration."
Where I want to push on the freeform case: attached corpus solves grounding — the model has authoritative product data to reason against. But it doesn't solve which fields the model should ground responses in. A store description is structured by schema, but isn't corpus, and isn't authoritative ground truth either. It's user-authored marketing prose the model has no signal to discount.
The thing I fixed at the prompt level was epistemic status, not just framing. The label I added was effectively "treat this field as decorative; do not infer product claims from it." That's a field-level declaration the schema should be carrying — neither pure structure nor attached corpus, but adjacent to both.
Three layers, then, if I'm following the argument: schema declares structure, attached corpus declares domain context, and the schema also needs to declare epistemic status per field — authoritative vs decorative. The bug I hit was the model treating a decorative field as authoritative. Attached corpus would give it better ground truth, but wouldn't tell it which field NOT to ground from.
This is the most underrated debugging lesson in LLM systems: "infra-looking" symptoms are almost always prompt or schema issues in disguise. Every time we've been tempted to rip out an LLM router or vector store, a postmortem found the real cause in the prompt contract — ambiguous role descriptions, conflicting few-shots, or an instruction that quietly contradicted a tool spec. Two practices that save us hours now: (1) golden traces — store the exact (prompt, model, params, response) for every failure class and diff against current behavior on each prompt change, and (2) treat prompts as code with versioning, code review, and CI evals. The two-line fix you found will reappear in someone else's system next quarter if it's not enforced structurally.
"infra-looking symptoms are almost always prompt or schema issues in disguise" — stealing that as the article's missing TL;DR.
What I'm taking from your comment: the 4-step checklist I ended on is the diagnostic layer. It catches the bug when you're staring at it. Golden traces + prompts-as-code is the prevention layer — it stops the next mislabel from reaching prod at all. They stack rather than compete. The article only covers diagnostics because that's where I currently am in my practice. Every postmortem I've published on Dev.to is essentially a failure-class snapshot in prose form — the manual version of what you've automated.
One thing I keep getting stuck on with eval-on-prompt-change for grounded LLMs: what's the diff criterion? Exact response match is too brittle (temperature noise, paraphrase drift). Structural match — "categories mentioned, tools called, hallucinated entities = 0" — feels closer, but it's harder to define generically across domains. How do you draw the line in your golden traces?
The trick is not picking one criterion — it's splitting the trace into three things and diffing each differently.
Deterministic structure (tool name, arguments, schema, retrieved IDs) > exact match. Most regressions land here before prose even matters.
Grounded content > set-membership against retrieval, not against the golden response. Your suits bug is a one-liner: mentioned_categories ⊆ retrieved_categories. Six to ten invariants like that probably cover most of a catalog product's failure surface.
Prose > rubric-scored judge model as a soft signal, never a gate. Skip embedding cosine; "suits" and "shirts" are neighbors in vector space, which is exactly the distance you need to resolve.
One thing I'd push on regardless: don't diff a single sample. Sample N at prod temperature and track invariant pass rate as a distribution a prompt edit that moves hallucination from 1% to 6% looks clean on any single diff.
Your checklist already encodes this; CI just needs the span roles (catalog / voice / history / hint).
mentioned_categories ⊆ retrieved_categories is the article's diagnostic step rewritten as a one-line invariant — exactly the upgrade from "where did the bytes come from" to "is the output bounded by valid sources." Banking that.
The cosine point is the one that subverts a default I would have reached for. The whole bug class lives at distances embedding similarity treats as "the same" — "suits" and "shirts" are neighbors in vector space because they're neighbors in the world, and that's exactly when the hallucination is most expensive to catch. Rubric-scored judge as soft signal is going on my list.
Distribution-over-samples is the one I haven't lived yet. Provia isn't in front of real merchants at scale, so today's "regression surface" is dogfood plus a small set of seed stores. The 1%-to-6% drift you're describing is invisible from where I'm standing — and that changes the moment the first paying tenant is on, which is when this stops being an article I'm bookmarking and starts being infrastructure I have to actually build.
Reading you next to Xidao's section split and Max's label contracts (other threads here), the convergence is striking — span roles, sections, contracts. Three different vocabularies for the same primitive: every piece of context the model sees needs an explicit role and an explicit forbidden-action. Your CI version is what enforces it at scale; the prompt-level version is what gets you to the starting line.