DEV Community

Sisofo Andrea
Sisofo Andrea

Posted on • Originally published at andreasisofo.it

Building a Voice AI Agent in Italian with ElevenLabs + n8n: Lessons From 200 Live Bookings/Month

I deployed a voice AI agent in 7 Italian restaurants. It handles 200 bookings a month, in native Italian, for €87/month total cost. Here is what worked, what broke, and the exact stack that ships in production.

No marketing fluff. Real numbers from 60 days of live deployment, the latency we hit, the edge cases that broke our agent, and the three scenarios where I now refuse to deploy voice AI even when the client begs.


Why Italian Voice AI Is a Different Beast

If you've shipped an English voice agent, here's what you should expect to break when you port the same architecture to Italian.

Latency tolerance is lower. Italian conversation has shorter pauses between turns. The 1.5s response time that feels "fast" in English starts to feel awkward in Italian above 1.2s. Native speakers begin to repeat themselves or check if you're still there. We had to drop our agent's first-response target from 1.5s to 1.0s before user satisfaction stopped degrading.

Regional accent variance is huge. A 60-year-old Roman saying "addò sta er ristorante" is grammatically Italian but ASR-wise it might as well be Catalan. We tested ElevenLabs ASR on three sample populations (Roman elderly, Neapolitan middle-aged, Northern under-30): WER ranged from 4% (Northern under-30) to 19% (Roman elderly). For restaurants in Rome's historic center, we had to add a clarification fallback at the first failed parse, not the third.

Cultural patterns matter for prompt design. Italian restaurant calls open with extended pleasantries ("buongiorno, scusi il disturbo, volevo solo sapere se per stasera..."). English-trained prompts that try to short-circuit straight to "what's your reservation?" feel rude. We added a 1-2 turn "social warmup" phase before pushing toward intent collection. Booking completion rate went from 71% to 89%.

English-trained TTS sounds robotic in Italian. This isn't just opinion. Blind test, n=50 Italian native speakers, two voice models: Italian-trained ElevenLabs voice scored 4.4/5 naturalness, English-trained voice doing TTS in Italian scored 2.1/5. The latter was correctly identified as AI 78% of the time within the first 10 seconds.


The Stack (in 4 Components)

After testing Vapi, Retell, Bland and a custom Whisper + GPT-4o + OpenAI TTS pipeline, here's what shipped to production:

1. ElevenLabs Conversational AI — voice synthesis + ASR + intent routing in one product. Italian native voices ("Bianca", custom-cloned), conversation flow handled in their dashboard. Cost: $0.08/min on the Creator plan. Why I picked this over a custom pipeline: managing Whisper + GPT-4o + TTS as three separate services added ~400ms latency and required a state machine I didn't want to maintain.

2. n8n self-hosted on a Hetzner CX22 (€5/month) — the orchestration layer. Webhook from ElevenLabs hits n8n on intent recognized ("book_table", "ask_menu", "modify_reservation"), n8n does the actual DB work (Postgres lookup, availability check, write reservation), responds back to the agent with structured data.

3. PostgreSQL on Supabase (free tier handles all 7 restaurants comfortably) — restaurant menu, tables, opening hours, reservations, customer history. Schema is boring on purpose: 6 tables, 22 columns total.

4. Twilio for Italian VoIP — €1/month per phone number, €0.013/min inbound. Yes, we evaluated Vonage, Plivo and Bandwidth. Twilio's Italian numbers had the best call quality and the only support team that actually answered Italian dial issues within 24h.

Total monthly cost per restaurant: €12.50 (~€5 Hetzner shared across 7 restaurants + €1 Twilio + ~€6.50 ElevenLabs minutes).

Alternatives I rejected after testing:

  • Vapi: comparable quality, 30% more expensive at our volume
  • Retell: better latency on English, noticeably worse Italian voice quality (synthetic accent)
  • Bland: cost-attractive but conversation quality not yet there for Italian
  • Custom pipeline (Whisper + GPT-4o + OpenAI TTS): 200-400ms latency penalty, broke my "one neck to choke" rule

The Knowledge Base Problem (And How I Solved It)

Restaurant menus change daily. A static system prompt with "today's menu" baked in goes stale in 24h. The naive solution — paste the new menu into the prompt every morning — breaks at scale (7 restaurants, 3 staff who don't want to log into a dashboard).

The workflow that works:

Every day at 06:00:
1. Cron in n8n triggers
2. Pulls menu PDF from each restaurant's shared Google Drive folder
3. Sends PDF to Mistral OCR (free tier, 1GB/month, more than enough for menus)
4. Parses returned text into structured JSON (dishes, prices, allergens)
5. Upserts into Postgres `menu_items` table with `valid_from = today`
6. Marks yesterday's menu `valid_to = today`
7. Pings ElevenLabs webhook to invalidate runtime cache
Enter fullscreen mode Exit fullscreen mode

40 lines of n8n nodes. Staff drop the PDF into Drive when they want, the agent picks it up next morning. Zero manual sync.

The trick that took me 3 iterations to find: don't load the full menu into the prompt context. The agent calls a menu_lookup tool function only when the user asks about food. Keeps context lean (cheaper), keeps responses focused, lets us A/B test menu phrasing without touching the agent prompt.


Real Numbers After 60 Days

Aggregate data across 7 restaurants, May 2026:

  • Booking completion rate (without human handoff): 90%
  • Average first-response time: 4.2 seconds (full conversation initiation, not first token)
  • User AI-detection rate in first 30 seconds: 6% (94% don't realize it's AI initially; blind test n=50)
  • Average call duration: 1m 47s (vs 2m 32s for human staff — same booking, faster collection)
  • Peak load handled: 12 simultaneous calls (Saturday 19:00-21:30)
  • Total monthly bookings handled by the agent across 7 sites: 1,403

The 10% that fails:

  • Large group bookings (>15 people) — agent escalates to "we'll call you back to confirm"
  • Complaints about previous experience — agent transfers to manager
  • Custom requests outside menu ("can you make this dish without gluten?") — explicit fallback to staff

These three categories cover ~95% of all handoffs. I list them in the agent's system prompt as "always escalate" cases. The remaining 5% are genuinely unclassifiable edge cases (drunk callers, kids playing with phones, sales calls from suppliers).


When Voice AI Is the Wrong Answer

I get 4-5 inbound requests per week now. I refuse roughly 30% of them. Here's when:

Scenario 1: under 25 calls/week. The ROI math doesn't work. Below that volume, the time the owner spends learning the system, training staff to interpret the dashboard, and dealing with the inevitable first-month edge cases costs more than the time saved. Pay a part-time receptionist.

Scenario 2: elderly-skewed clientele (>70% over 65). Voice AI works fine with seniors who are tech-comfortable, but if the bulk of your callers are 70+ and not tech-comfortable, the conversational friction is real. We had one trattoria in Pescara where 80% of bookings were elderly regulars. Booking completion rate stayed at 62% even after 4 weeks of prompt tuning. We turned the agent off.

Scenario 3: highly consultative calls. Anything where the value is in extended human conversation — legal intake, medical triage, financial advice — should stay human for ethical reasons before practical ones. Voice AI can take a callback request. It shouldn't take a substantive professional consultation.

Saying no to these is how I keep the 90% completion rate on the agents that do ship.


Closing

If you want the full Sofia case study (architecture diagrams, n8n workflow JSON, the actual ElevenLabs prompt template, screenshots of the dashboard), I documented everything publicly at andreasisofo.it/sofia-ristoranti.

I'm Andrea Sisofo, freelance ex-BBDO based in Rome. I build voice AI agents in Italian for SMBs. Reach me at andreasisofo.it or on LinkedIn.

Open to questions in the comments.

Top comments (0)