Stop writing prompts to classify text: make evaluation declarative

#typescript #ai #softwareengineering #llm

I've built the same thing more than once: a step that reads an inbound
message — a lead form, a support ticket, a DM — and decides what to do with
it. Qualify it, escalate it, route it, drop it.

Every time, the implementation had the same shape: hand-write a prompt that
asks an LLM to return JSON, parse the JSON, branch on it. And every time it
rotted the same way:

The prompt was untestable. "Looks right" was the only QA.
It drifted. A model upgrade or a one-word prompt tweak silently changed the output and nobody noticed until something got misrouted.
The JSON lied. The model would confidently return a category that wasn't in my allowed set, or a number outside the range I expected, and my downstream switch would happily act on garbage.
"Confidence" was theater. if confidence > 0.7 is a magic number that means nothing across different inputs.

Eventually I stopped writing prompts. This is what I built instead and what I
learned.

The core idea: declare what to detect, not how to ask

Instead of a prompt, you declare typed detectors. Two kinds:

presence — "is this signal in the text?" → returns found
classification — "put this into exactly one of a fixed set of categories" → returns the category, enum-validated

{
  "name": "Partnership routing",
  "template": "custom",
  "custom_detectors": [
    {
      "name": "competitor_mentioned",
      "type": "presence",
      "examples": ["we're currently using Acme", "switching off a competitor"],
      "non_examples": ["love your product"]
    },
    {
      "name": "partnership_inquiry",
      "type": "classification",
      "categories": ["reseller", "affiliate", "strategic", "none"],
      "examples": [
        "interested in your reseller program",
        "want to co-sell with you",
        "just a support question"
      ]
    }
  ]
}

You never see the prompt — it's compiled from the declaration. That part isn't
the interesting bit; anyone can template a prompt. The interesting bit is
everything that becomes possible because a detector is a typed object instead
of a string.

1. Detectors are tested at create-time, not in prod

Every detector requires at least one positive example. When you create the
evaluation, those examples run as a smoke test: each positive must actually
match, and a classification example must land inside its declared
categories. If it doesn't, creation fails.

This is the part I wish I'd had years ago. A prompt can be syntactically fine
and semantically broken, and you find out in production. Here, a broken detector
can't ship — the assertion runs before it's ever live. (non_examples are
presence-only, because a classification detector always lands somewhere, so
there's no "not found" state to assert.)

2. The output is validated deterministically, not trusted

The LLM proposes; deterministic code disposes. The validators are boring on
purpose: present, range:0-100, enum:yes,no. An out-of-set classification
doesn't get to pass — it's coerced to not-found and surfaced in an
invalid_fields list so you can see the model misbehaved instead of silently
acting on it.

This is the line I'd defend hardest: structured outputs / function calling get
you a valid shape. They don't get you a checked value. A schema says "this
is a string from a set"; it doesn't run your range check or tell you the model
went off-menu.

3. Escalation is rules, not a confidence threshold

Escalation is separate from the model's self-reported confidence. You write
triggers on extracted values:

a classification trigger fires when the value is in a declared set
a presence trigger fires when the detector is found
required: true triggers are ANDed; required: false are ORed

So "escalate if it's a strategic partnership AND a competitor is mentioned" is
expressible and deterministic. No magic 0.7.

4. One call, structured decision out

POST /v1/evaluate → { status, extracted_signals, next_action }

status is one of QUALIFIED / PARTIAL / FAILED / ESCALATE. That's the whole
point: the thing my switch branches on is a small closed enum, not free text I
have to parse and pray over. It drops straight into n8n/Zapier/Make. You can
also POST real outcomes back later (converted? deal value? days to close?) so
the rubric can be measured against reality instead of vibes.

What I got wrong / what's still ugly

Being honest, because these are real:

Everything hits the LLM today. Even an obvious keyword goes through a model call. The plan is a pre-LLM pattern extractor so deterministic signals never pay for inference — not built yet. So cost/latency is "one batched LLM call per eval": fine for inbound webhooks, not for high-QPS streams.
It's synchronous. Long transcripts are slow; I prepend a structured header (severity/tier) instead of dumping a 5k-token thread.
Batching detectors into one prompt keeps cost down but lets one detector's phrasing bleed into another's extraction. Isolating them costs N calls. I chose cost; not sure it's right.
Multi-turn is naive. Re-evaluating a growing conversation re-sends the whole thing. Delta prompts are on the list.

The question I actually have

Is "declare + validate + smoke-test" the right altitude? Or do people doing this
seriously want prompt-level control and would find the abstraction a cage the
first time they hit an edge case?

My bet: for the 80% case — lead qual, ticket triage, intent on inbound — nobody
should be hand-maintaining a classification prompt, the same way nobody
hand-writes a query planner. But I've been wrong about abstractions before.
Curious where this breaks for you.

I packaged this up as the evaluation API behind EchoStack
— you can run an evaluation on your own text in the demo
(no signup) or skim the API quickstart.