I've built the same thing more than once: a step that reads an inbound
message — a lead form, a support ticket, a DM — and decides what to do with
it. Qualify it, escalate it, route it, drop it.
Every time, the implementation had the same shape: hand-write a prompt that
asks an LLM to return JSON, parse the JSON, branch on it. And every time it
rotted the same way:
- The prompt was untestable. "Looks right" was the only QA.
- It drifted. A model upgrade or a one-word prompt tweak silently changed the output and nobody noticed until something got misrouted.
- The JSON lied. The model would confidently return a category that wasn't in
my allowed set, or a number outside the range I expected, and my downstream
switchwould happily act on garbage. - "Confidence" was theater.
if confidence > 0.7is a magic number that means nothing across different inputs.
Eventually I stopped writing prompts. This is what I built instead and what I
learned.
The core idea: declare what to detect, not how to ask
Instead of a prompt, you declare typed detectors. Two kinds:
-
presence — "is this signal in the text?" → returns
found - classification — "put this into exactly one of a fixed set of categories" → returns the category, enum-validated
{
"name": "Partnership routing",
"template": "custom",
"custom_detectors": [
{
"name": "competitor_mentioned",
"type": "presence",
"examples": ["we're currently using Acme", "switching off a competitor"],
"non_examples": ["love your product"]
},
{
"name": "partnership_inquiry",
"type": "classification",
"categories": ["reseller", "affiliate", "strategic", "none"],
"examples": [
"interested in your reseller program",
"want to co-sell with you",
"just a support question"
]
}
]
}
You never see the prompt — it's compiled from the declaration. That part isn't
the interesting bit; anyone can template a prompt. The interesting bit is
everything that becomes possible because a detector is a typed object instead
of a string.
1. Detectors are tested at create-time, not in prod
Every detector requires at least one positive example. When you create the
evaluation, those examples run as a smoke test: each positive must actually
match, and a classification example must land inside its declared
categories. If it doesn't, creation fails.
This is the part I wish I'd had years ago. A prompt can be syntactically fine
and semantically broken, and you find out in production. Here, a broken detector
can't ship — the assertion runs before it's ever live. (non_examples are
presence-only, because a classification detector always lands somewhere, so
there's no "not found" state to assert.)
2. The output is validated deterministically, not trusted
The LLM proposes; deterministic code disposes. The validators are boring on
purpose: present, range:0-100, enum:yes,no. An out-of-set classification
doesn't get to pass — it's coerced to not-found and surfaced in an
invalid_fields list so you can see the model misbehaved instead of silently
acting on it.
This is the line I'd defend hardest: structured outputs / function calling get
you a valid shape. They don't get you a checked value. A schema says "this
is a string from a set"; it doesn't run your range check or tell you the model
went off-menu.
3. Escalation is rules, not a confidence threshold
Escalation is separate from the model's self-reported confidence. You write
triggers on extracted values:
- a classification trigger fires when the value is in a declared set
- a presence trigger fires when the detector is found
-
required: truetriggers are ANDed;required: falseare ORed
So "escalate if it's a strategic partnership AND a competitor is mentioned" is
expressible and deterministic. No magic 0.7.
4. One call, structured decision out
POST /v1/evaluate → { status, extracted_signals, next_action }
status is one of QUALIFIED / PARTIAL / FAILED / ESCALATE. That's the whole
point: the thing my switch branches on is a small closed enum, not free text I
have to parse and pray over. It drops straight into n8n/Zapier/Make. You can
also POST real outcomes back later (converted? deal value? days to close?) so
the rubric can be measured against reality instead of vibes.
What I got wrong / what's still ugly
Being honest, because these are real:
- Everything hits the LLM today. Even an obvious keyword goes through a model call. The plan is a pre-LLM pattern extractor so deterministic signals never pay for inference — not built yet. So cost/latency is "one batched LLM call per eval": fine for inbound webhooks, not for high-QPS streams.
- It's synchronous. Long transcripts are slow; I prepend a structured header (severity/tier) instead of dumping a 5k-token thread.
- Batching detectors into one prompt keeps cost down but lets one detector's phrasing bleed into another's extraction. Isolating them costs N calls. I chose cost; not sure it's right.
- Multi-turn is naive. Re-evaluating a growing conversation re-sends the whole thing. Delta prompts are on the list.
The question I actually have
Is "declare + validate + smoke-test" the right altitude? Or do people doing this
seriously want prompt-level control and would find the abstraction a cage the
first time they hit an edge case?
My bet: for the 80% case — lead qual, ticket triage, intent on inbound — nobody
should be hand-maintaining a classification prompt, the same way nobody
hand-writes a query planner. But I've been wrong about abstractions before.
Curious where this breaks for you.
I packaged this up as the evaluation API behind EchoStack
— you can run an evaluation on your own text in the demo
(no signup) or skim the API quickstart.
Top comments (0)