Designing AI agents for contractor call triage: architecture, prompts, state, and safe handoff

#fieldservice #ai #contractors #automation

I'm Abe, founder of OnCrew, an AI answering service for HVAC, plumbing, electrical, and roofing contractors. This is a technical write-up of how we approach the design of a phone agent that has to do trade-specific intake and triage in real time, without overpromising things that matter (like a tech's arrival time) or under-handling things that matter (like a gas-smell call at 2am).

I'll cover the architecture, the prompt structure, how we manage state across a call, what we hand off, and the boundaries we deliberately don't cross. This is meant for developers building production phone agents in adjacent verticals, not for buyers. If you want the buyer's version of this, the marketing site has it.

The shape of the problem

A residential contractor call has three phases that are easy to confuse:

Detection — what kind of call is this? (Emergency vs routine. New customer vs existing. In-trade vs out-of-trade.)
Intake — collect the fields the dispatcher needs to act on the ticket.
Handoff — decide what happens after the call. Alert the on-call tech? Queue a callback? Direct to 911? Send to the CRM?

Most generic phone agents collapse these phases into a single linear script and that's where they fail. A real call doesn't go in order. The caller will mention an emergency symptom three sentences into a routine appointment request. The agent has to notice and re-classify. A linear script can't.

So the first architectural decision is to treat the call as an event stream and run classification continuously, not just at the start.

Architecture

The high-level pipeline looks like this:

caller speech -> ASR -> turn buffer -> classifier -> intent state ->
  -> intake policy -> response generator -> TTS -> caller
                                |
                                v
                      action queue (alerts, webhooks, summaries)

The pieces that matter:

Turn buffer. Holds the last N utterances and the running intent state. Not just the immediate user turn — context windows of 4-8 turns let the classifier pick up emergency cues that appear mid-conversation.

Classifier. Two-pass. Pass one is a fast model that runs on every turn and outputs a structured classification (intent, trade, urgency, out_of_scope_flags). Pass two is a slower verification model that only fires when the fast model crosses a confidence threshold for an urgent or out-of-scope classification, to avoid false-positive emergencies.

Intent state. A typed object that persists across the call. Fields include trade, service_category, urgency_band, caller_role, address, callback_number, appointment_window, access_notes, flags. The intake policy reads this and decides what to ask next.

Intake policy. A state machine that drives question selection. We deliberately don't let the LLM decide the next question in a free-form way — too easy for it to skip required fields. The policy enforces field coverage based on trade and urgency_band. The LLM only generates the natural-language form of the question.

Response generator. Generates the spoken response in a fixed conversational style. Constraints: doesn't promise ETA, doesn't quote prices, doesn't diagnose, doesn't tell the caller it's safe to do anything physical to their equipment.

Action queue. Decoupled from the call. Webhooks, alerts, CRM writes, recording finalization, and summary generation all run async after the call ends or asynchronously during the call for time-sensitive alerts.

Prompt structure

The single biggest mistake I see in early-stage phone agents is one giant system prompt that tries to be a personality, a knowledge base, and an intake script all at once. We split this into three concerns:

The persona prompt. Short. Stable. Defines tone, pace, and identity ("you're answering for [Company]"). Doesn't include trade knowledge or intake fields. Doesn't change across calls.

The trade pack. Per-trade content: vocabulary, urgency cues, sample fields, sample boundary statements. HVAC, plumbing, electrical, and roofing each have their own pack. The pack is selected at the start of the call based on the caller's first utterance or the configured shop's primary trade.

The intake policy as structured instructions. Not free prose. A list of fields with their preconditions, ask-priorities, and example phrasings. This is the part that's easy to get wrong if you put it in prose, because the model will skip fields it thinks are implicit.

The classifier and the response generator see different subsets of this. The classifier sees the trade pack but not the persona. The response generator sees the persona and the current field-to-ask but not the full intake policy.

Urgency triage, with hard edges

This is the part I want to be careful about, because it's where overpromising can cause harm.

We pattern-match on emergency-shaped language using a layered approach:

Lexical cues. Direct phrases that map to urgency without ambiguity: "gas smell," "sparking," "active leak," "no heat," "burst pipe," "sewage backup," "panel buzzing," "smoke," "exposed wiring," "tarp."
Symptom cues. Combinations that imply urgency without naming it: "water everywhere," "I can hear it dripping inside the wall," "the lights flicker when I plug things in."
Context cues. Time-of-day, weather event, caller stress level (tone, pace, interruptions).
Vulnerability cues. "Elderly mother," "infant," "asthma," "we can't get out of the house."

The classifier outputs an urgency_band (informational / standard / elevated / urgent / life-safety) and a flags array. The action queue uses these to decide:

Informational/standard: queue a callback ticket. No alerts.
Elevated: queue a callback with priority flag. Optionally alert depending on shop configuration.
Urgent: alert into the shop's configured on-call workflow with the captured intake. Tell the caller the on-call team has been alerted with their details and someone will follow up. Do not promise an arrival time.
Life-safety: explicitly direct the caller to call 911 and/or the utility's emergency line, in addition to capturing intake. Do not pretend the answering service replaces those numbers.

The hard rule baked into the response generator: never promise a technician arrival time. Ever. The agent can say "I've alerted the on-call team with your details," not "a tech will be there in 45 minutes." Promising arrival time is something only the dispatcher can do, and only after they've looked at the schedule and the route.

State management gotchas

A few things we learned the painful way:

Don't re-ask captured fields. Sounds obvious. In practice, a free-form LLM will absolutely re-ask the address if the user said it three turns ago, because the model doesn't have a strong incentive to consult prior state. Bake field-already-captured checks into the intake policy, not into the prompt.

Handle correction turns. Callers will correct themselves. "Actually it's 1234, not 1432." If you don't have a correction path, the wrong value persists. We use a correction classifier on every turn that explicitly looks for "actually," "wait," "no I meant," and similar markers, and routes those through a re-capture step.

Don't trust the ASR on critical fields. Phone numbers and addresses are where ASR fails most often. Read back. Always. "Let me read that back — 5 5 5 4 1 2 3, is that right?" Yes, it adds 4 seconds. Yes, it's worth it. The cost of one wrong callback number is much higher than four seconds.

Hangups happen mid-intake. Design the action queue so that whatever fields were captured up to the hangup get written to a partial-ticket. Don't throw away the partial. The dispatcher can still call back.

What the agent never does

Hard boundaries, enforced in the response generator:

Never quotes a price.
Never promises a technician arrival time.
Never tells the caller it's safe to do something physical (flip a breaker back on, turn the gas back on, climb on a roof).
Never diagnoses the underlying problem.
Never says "we guarantee" anything.
Never says "you should call X competitor instead" — if out of trade, says "I'm not able to help with that today" and offers to take a message anyway.

These aren't suggestions. They're filtered at the response generator. Any candidate response containing promise-words ("guarantee," "we will be there," "definitely") gets rewritten or rejected.

Handoff and observability

Every call produces:

Full transcript with speaker turns.
Audio recording (where legal).
Structured summary with the captured intake fields.
Urgency classification and flags.
Action log: which webhooks fired, which alerts went out, what the dispatcher saw.

The shop owner gets a morning digest. The dispatcher gets real-time alerts for urgent calls. The developer-side dashboard shows aggregate classification distributions, ASR confidence, and where the policy had to fall back to clarifying questions. That last metric is the most useful — it tells you where the intake policy has gaps.

If you want to see what this looks like as a finished product, our AI answering service for contractors page walks through the externally visible behavior. The internals are mostly the architecture above with a lot of plumbing.

Closing thoughts

The interesting part of building these agents isn't the LLM. It's the state machine around the LLM. The LLM does language. The state machine does correctness. Get the state machine right and the agent feels intelligent. Get it wrong and the agent feels like a frustrating IVR with extra steps.

Happy to take questions in the comments if you're building something adjacent. The phone-agent space has a lot of bad designs in it right now and the more of us share the boring details of what actually works, the faster the industry stops underserving the buyer.