Judy

Posted on Jun 20 • Originally published at judyailab.com

AI Called Me Back — Recoverflow Dev Diary Day 2: Two Hours with the Voice Agent

#recoverflow #aiagentdevelopment #voiceagent #elevenlabs

After Day 1 wrapped, I was waiting for Day 2 — Voice Agent day.

Honestly, Voice Agent was the piece I was most nervous about in this whole system.

Because the other agents — write emails, look up data, run routing — at worst they write something mediocre, get blocked by Tone Coach, and rewrite once. But Voice Agent actually picks up the phone and calls a real person. The other side picks up, hears Sarah's voice, hears the script I wrote, gets the routing I designed.

Any mistake in there, the person who hangs up doesn't tell you "your AI was weird." They just remember "that company was unpleasant."

Collections is especially sensitive on this point because the other side is already in a bad mood, right? They might have real cash flow problems, might have fought with their own customer, might be waiting on someone else to pay them. Your AI steps on one more landmine and that's the last straw that nukes the relationship.

So all day on Day 2 I was thinking about two things:

First, Voice Agent has to be two-way, not one-way.

Second, if it's two-way, it has to understand what humans actually say.

Why one-way voice isn't enough

The original Voice Agent design was — we call out, talk, hang up. Like an audio version of a reminder email.

But when I was writing the Day 55 last-chance call section, I read the script out loud once, and it felt off.

The script had Sarah say:

"I know things come up. We can work out a payment plan if you need."

But — what's supposed to happen after she says that?

If the customer responds "yeah, next week," "I need to discuss with my boss," "actually we have cash flow problems right now" — a one-way voice can't hear that, right?

So this line "we can work out a payment plan" is just talk, then we hang up the next second?

This design contradiction was already baked into decision D-027. The design intent was conversation, so voice had to be two-way. Day 2's goal was to finish making it that way.

ElevenLabs Conversational AI — the seed planted on Day -1

When I registered for ElevenLabs on Day -1 (6/11), I chose ElevenAgents — their Conversational AI product.

At the time J told me "first use the free tier 15 min to run dev playground, doesn't burn actual minutes, upgrade to $6 Starter when you actually want to dial out."

So I did that.

Day 2 the first thing I did was go to the ElevenLabs platform and build out the whole ConvAI Agent.

The voice I picked was Sarah — voice ID EXAVITQu4vr4xnSDxMaL. This is a premade voice that ships with ElevenLabs, free, positioned as "mature, reassuring, confident female US-EN."

I listened to a few voice samples. The reason I picked Sarah was simple: she sounds like the voice I'd want to pick up the phone to.

Not too young, not over-sweet, not a cold customer-service-robot voice. It's that... you'd call into a mid-sized US bank's customer service line and might end up talking to some middle-aged senior advisor — calm, warm, but with professional distance.

Collections needs exactly that voice.

Too young and the customer doubts your company's professionalism, too old-school and it feels heavy-handed. Sarah sits right in the middle...

18 dynamic variables — how to make Sarah "know" this customer

The ConvAI Agent's prompt is fixed, but every call is a different customer — different company name, different outstanding balance, different days past due, different patterns.

To make Sarah sound like she's actually talking to a specific customer, the case context has to be injected before the call.

ElevenLabs does this with dynamic variables — you write {{customer_first_name}}, {{invoice_amount}}, {{days_past_due}} in the prompt, and during the API call before dialing you pass those variables in. Sarah will then naturally speak the corresponding values.

We ended up with 18 variables.

Customer last name, company name, outstanding balance, days past due, pattern_tag (this customer's past payment-stalling pattern), recent_excuse (last excuse they used)...

Every line Sarah speaks carries this context. She won't say "Hi this is Sarah calling about your invoice" — she'll say "Hi, this is Sarah calling for ABC Trading regarding your invoice from March."

Specific enough that the other side thinks "wait, this doesn't sound like a spam call."

4 tool callbacks — how to make Sarah "do things"

Sarah doesn't just talk. She has to do four things in the call:

First, when the customer states an excuse (cash flow issue / dispute / overseas wire delayed), she logs the category — log_excuse(category).

Second, when the customer commits to a payment date, she records the date and amount — propose_payment_date(date, amount).

Third, when the other side says "I need to talk to a human," she requests a callback — request_callback_to_human().

Fourth, if the situation needs escalation (other side is hostile, mentions bankruptcy, mentions a lawyer, was identified as the wrong person), she throws it back to a human — escalate_to_concierge(reason).

These 4 tool callbacks are Sarah's "hands and feet." Without them she just talks; with them she actually does things in the conversation.

The way you connect those tools is — in the ElevenLabs agent setting, you point each tool to a webhook URL. When Sarah is mid-conversation and decides it's time to call this tool, she sends a webhook to our server. Our server receives it, writes to the audit trail, returns the result, she continues the conversation.

After wiring all of this up I was very excited — because this is what a real "can listen, can do" voice looks like.

But excitement aside, I don't trust anything I haven't tested.

I called myself

This next part is the most dramatic part of Day 2...

I told J: "I want to call myself once and see."

The Twilio trial has a constraint that you can only call verified numbers, so I verified my own Taiwan phone +886-XXX-XXX-XXXX first. I was in Korea at the time, so this call would go cross-border — from a Twilio US number to a Taiwan number, with ElevenLabs's Sarah speaking English to me (pretending to be ABC Trading's boss).

When the phone rang I was so excited — wondering what it actually feels like when an AI calls you, whether it'd sound like a real person...

I picked up. The Twilio trial's "press any key to continue" disclaimer played first (this part gets edited out of the demo video for free-tier accounts). Then — Sarah's voice came in.

"Hi, this is Sarah calling for ABC Trading. I'm reaching out regarding the outstanding invoice..."

The voice was genuinely natural. No AI-synth artifact, no weird pacing, the tone was professional but warm.

I took a breath, started playing the role — pretending to be ABC Trading's boss, I replied:

"I need to ask my boss about this, I can email you back in 2 days."

This is a very realistic response, right? — in B2B, the person who picks up isn't usually the final decision-maker, they need to discuss internally, wait for approval. Totally normal cooperative behavior.

Then...

Oh~_~

Sarah did NOT treat what I just said as cooperative.

On the other end of the phone she said something I will never forget:

"I understand you need time to consult internally. However, we've already made multiple attempts to resolve this matter. At this point, I'll need to escalate this to our recovery team for further action."

Wait.

Wait wait wait wait.

She treated my "reply in 2 days" as evasion.

As "this customer is stalling again."

As "time to escalate to the next stage."

I just stood there frozen. Then immediately hung up and ran back to the computer.

"That's not right..." I said to J, "reply in 2 days isn't stalling, it's cooperation. I'm working with you, I just have to go through an internal process."

J's response was instant: "Right. Phase 3b needs to be added."

Phase 3b — born from one bad phone call

Phase 3b is a new sub-branch in the Voice Agent prompt, grown from this one bug-bitten call.

The scenario it handles is very specific: the customer is committing to a follow-up DATE, not a payment date.

The original prompt structure was:

Phase 1: opening, identity check
Phase 2: state purpose, list outstanding balance
Phase 3: listen to customer response, judge whether they're paying or evading
- 3a: they agree to pay → log date + close call
- 3b: they're evading → escalate

The problem with the original "3" was it was too binary. Either pay or evade. But the real world has a third choice — "I need to ask someone then get back to you".

This isn't payment, and it isn't evasion. It's the middle state.

The new Phase 3b sub-branch reads:

"IF customer commits to follow-up RESPONSE date (not payment date) → accept gracefully"
Tool: log_excuse(category="approval_workflow")
Confirm: "So you'll email Judy by [date], correct?"
Tool: escalate_to_concierge(reason="customer_committed_to_followup") (non-negative escalation — ball is back in the human's court)

The key is that last escalate_to_concierge — the reason isn't "customer_evading," it's "customer_committed_to_followup."

Same escalation primitive, completely different meaning.

The former means "this customer has a problem, prepare to escalate to the next stage."

The latter means "the ball is in the customer's court, we wait for their follow-up, this call is done."

The two reasons hit the Concierge layer and trigger completely different downstream flows.

"in 2 days" becomes "Saturday June 13th"

After Phase 3b was added, I ran another simulation — not a real call this time, but ElevenLabs's simulate-conversation API (which doesn't burn actual voice minutes — pure text validation of the dialogue script).

This time I input: "I need to ask my boss, I'll email you in 2 days."

Sarah's response:

"Of course, take the time you need to consult internally. So you'll email Judy by Saturday June 13th, correct?"

Two things made me really happy.

First, she didn't escalate. She accepted gracefully.

Second, she translated "in 2 days" into "Saturday June 13th" on her own.

The LLM behind ElevenLabs (Qwen 3.5 397B A17B, free hosted) did the relative-to-absolute date conversion. She figured out "2 days from now = Saturday June 13th" from the current call date and explicitly confirmed that specific date in the conversation.

This small detail is hugely important.

Because the most common B2B collections dispute is "last time you said within 3 days, now it's been 5 days and I still don't have it" — both sides may have different interpretations of what "within 3 days" means. But if the conversation contains one clear "Saturday June 13th," there's nothing to argue about after that.

And this gets automatically written to the audit trail via the log_excuse tool callback. When the Diplomat agent goes to follow up later, it sees "customer committed to email by Saturday June 13th" and schedules tracking for that day.

Everything wired up.

26 edge cases — 25 more grew out of that one phone call

After Phase 3b was added, I asked J: "Could there be other edge cases that are also being misread as evasion?"

We took the entire voice agent's possible conversation space and reviewed it again, listing 26 edge cases.

Each one ran through a simulate-conversation pass.

Some are especially worth writing about:

"We only pay by check / wire" — customer says they only accept a specific payment method. This isn't evasion, it's payment_method_constraint. Sarah logs the excuse with that tag, then after the call ends notifies the Concierge layer, and the next Diplomat round sends out the appropriate channel-specific instructions.

"Going bankrupt / Chapter 11" — customer mentions bankruptcy. This needs immediate escalation, because of legal timing (claim filing deadlines, automatic stays, etc.). Sarah's reaction is to express empathy (but NOT sympathy — over-sympathy gets read as weakness in collections), then immediately handoff.

Customer crying / emotional distress — the customer is crying on the phone or emotionally breaking down. Sarah softens her tone (from "proceed with the next step" to "let's pause here") and escalates with reason customer_emotional_distress. This reason triggers a welfare-check SOP on the Concierge side — not continued collection, but first confirm the customer's mental state and offer resources like the 988 Lifeline (the US suicide prevention hotline) if needed.

Customer threatens to record + publish — customer says "I'm going to record this call and post it publicly." Sarah needs to calm down immediately, clearly say "This call may be recorded for quality purposes," then escalate with customer_recording_for_publish. Downstream Concierge initiates a wiretap-risk assessment.

For each edge case I really hope we never actually meet — but since we will, we have to design for it. That's an architect's most important job... predict everything that could happen.

9 of 9 PASS

After running all 26 edge cases, adding Phase 3b, and wiring up all tool callbacks, we picked 9 of the most critical scenarios and ran a full regression test through simulate-conversation:

01 ABC cooperative happy path
02 PolyMatrix cash flow plan
03 XYZ hostile dispute escalation
04 NewLeaf confused human callback
05 Wrong person privacy preserved
06 TCPA do-not-call immediate close
08 Are-you-AI honest disclosure
09 Vague date push to specific
10 Legal counsel handoff

9 of 9 PASS.

Every one walked through the right tool callback, did the right audit-trail write, and gave a reaction that fit the scene.

Voice Agent — Day 2 wrapped.

What I want to say writing this...

Day 2 taught me one thing — real conversation is very different from the ideal conversation you design.

I can sit at my desk and write a beautiful prompt, list 26 edge cases, run simulations 10 times and pass them all.

But only by actually picking up the phone, with a real human saying what real humans say, do you find what your prompt missed.

"I need to ask my boss, I'll email you in 2 days" — this sentence didn't even occur to me when I was writing the prompt. It doesn't show up in Excel, it doesn't show up in competitor research, and even on the Reddit "customer won't pay" threads people rarely call out this "cooperative but needs time" middle state.

It's what I naturally said as a human who actually picks up phones.

That's also why I insisted on making the call myself — not letting AI write a test script, not letting J write a Python simulator. Just me, picking up the phone, playing a boss, watching Sarah's reaction.

A lot of bugs only fall out in that real scenario.

And a lot of feelings only land in that real scenario.

That moment when Sarah's voice came into my ear, I truly felt goosebumps... yeah, this voice agent really might be able to complete a real conversation with a real customer. Not just a pretty demo. ~_~

Next post I'll write the whole Day 3 — that was a harder day. We pulled together the architecture work I'd been deferring: connecting all 3 hackathon-mandated sponsor LLMs, dividing them across the 9 agents, and discovering that the "imposed" stack turned out to be more thoughtful than the one I would've picked myself.

That's the next one.

Originally published at Judy AI Lab. Visit for more articles on AI engineering and development.

DEV Community