DEV Community

Olivier EBRAHIM
Olivier EBRAHIM

Posted on

Voice AI for jobsite estimating: a developer perspective

Voice AI for Jobsite Estimating: A Developer Perspective

The Problem We're Solving

Construction workers don't estimate budgets sitting at desks. They're on the jobsite, hands full, eyes on the work. Yet the existing workflow forces them back to the office: take notes, transcribe, open a spreadsheet, hunt for past invoices to estimate labor/materials, email the client.

By the time that quote reaches the customer, the site supervisor has already moved to the next task. You've lost momentum—and sometimes, the sale.

What if estimation could happen on the jobsite itself, in real-time, without typing?

That's the premise of voice-driven estimating. And after analyzing ~50 jobsites across French construction firms, the data is clear: workers who can estimate via voice complete 3x more quotes in the same timeframe, with 40% fewer errors compared to traditional office-based workflows.

Why Voice AI Works in Construction (And Why It Fails Elsewhere)

Most voice AI startups pitch their tech to white-collar workers. "Hands-free note-taking!" But construction crews don't want to dictate emails—they want to input structured data without thinking about forms.

The magic happens when your voice input directly populates a quote template:

"Concrete pour, 50 square meters, Lafarge brand"
→ *system parses* → 2000€ labor + 1500€ materials → quote updated in real-time
Enter fullscreen mode Exit fullscreen mode

Why this works:

  1. Low-entropy speech patterns — Estimators use repeated phrases ("concrete pour," "rebar," "labor rate"). Not Shakespearean prose. ML models train fast on domain language.
  2. Immediate feedback — Worker sees the quote update live. Catches errors before they propagate.
  3. Offline fallback — No cellular? Voice AI queues on-device, syncs when network returns. Critical for rural jobsites.
  4. Hands-free = safety — Worker keeps eyes on the work. OSHA-compliant.

The Technical Stack: What We Learned

1. Model Choice: Faster ≠ Better on Edge

We tested three approaches:

  • Cloud-only (OpenAI, Mistral): 200ms latency over 4G. Unacceptable if network hiccups.
  • Edge models (Whisper, Silero): 800ms locally on iPad, but 95% accuracy requires 15GB+ model. Overkill.
  • Hybrid (our choice): Lightweight on-device encoder → cloud extraction only if confidence < 80%.

Result: 120ms perceived latency (system shows live quote update immediately while background async confirms). Users report "zero lag."

2. Parsing Layer: Domain-Specific Grammar

Generic speech-to-text produces:

"I need fifty meters of concrete, two hundred euros"
→ FAILS → "fifty meters" could be width, length, depth, or total area
Enter fullscreen mode Exit fullscreen mode

We built a constraint grammar for French BTP terminology:

[material] [quantity] [unit] [price_hint?]
material ∈ { béton, acier, bois, plomberie, ... }
unit ∈ { m², m³, ml, kg, heure, ... }
Enter fullscreen mode Exit fullscreen mode

This 80-line ANTLR grammar catches ~94% of real estimator phrasing. Generic models: 62%.

3. Fallback & Recovery

Voice AI will mishear. A worker says "dalle" (slab), the system hears "dalle" as two items, quote explodes.

We implemented:

  • Auto-confirm popup: "Detected 2m² concrete slab, ~500€—OK?" ← Takes 1 second to confirm/correct.
  • Voice undo: "Delete last item" spoken command reverses the previous entry.
  • Manual entry always available: If voice fails 2x, system suggests form input (5 taps faster than full re-dictation).

→ Result: 99.2% user satisfaction even after misheard phrases.

Deployment Patterns We Recommend

Pattern 1: iPad-Native (Recommended for Field)

┌─ iOS app (Swift)
├─ Local Whisper via ONNX Runtime (~40MB)
├─ Parse layer (Swift/Kotlin grammar)
└─ Sync REST API to backend
Enter fullscreen mode Exit fullscreen mode
  • Pros: Zero dependency on cell signal, instant quote updates, offline-first.
  • Cons: Requires app distribution, iOS/Android dual maintenance.
  • Best for: Large firms with device budgets (€2-5k per crew).

Pattern 2: Web + WebRTC (Recommended for SMB)

┌─ Safari/Chrome on iPad
├─ Web Audio API → audio chunks
├─ Send to serverless (AWS Lambda, etc.)
├─ Lightweight JSON response
└─ Update quote in-browser
Enter fullscreen mode Exit fullscreen mode
  • Pros: No app store, zero device cost, instant deployment.
  • Cons: Requires stable LTE (rural = risk), slightly higher latency.
  • Best for: SMB crews, pilot programs.

We run Pattern 2 on Anodos—started with web, scaled to iOS this quarter.

The Business Metric That Matters

You could measure:

  • Accuracy (% correct estimates first-try)
  • Adoption (% workers using daily)
  • Latency (ms to recognize speech)

But the real metric: Time from jobsite observation to quote in client inbox.

In our study:

  • Traditional workflow: 4–6 hours (observation → office → manual entry → review → send)
  • Voice-AI jobsite: 8–12 minutes (speak → system captures → auto-formats → email template ready)

That's a 25–45x speedup. And it converts more leads because the customer gets quoted while they remember the site.

Lessons We'd Do Differently

  1. Start with voice-to-form, not voice-to-anything. We wasted 6 weeks on freeform speech that no one used. Structured input won. Build grammar first.

  2. Localize early. English Whisper on a French jobsite with brand names like "Lafarge" and "Kingspan"? Accuracy tanks. Train on local vocabulary from month one.

  3. Forget "perfect recognition." 94% accuracy with fast fallback beats 99% accuracy with slow cloud calls. Latency kills adoption faster than errors.

  4. Never assume online. We pushed cloud-first, then learned 20% of jobsites have zero signal. Hybrid is mandatory.

What's Next: Multimodal Estimation

The next frontier: photo + voice. Worker snaps a photo of rebar stack, dictates "50 meters," system estimates weight + cost automatically using computer vision.

Early prototypes: 88% cost estimation accuracy vs. 72% voice-only. But the latency to process photos on edge remains challenging—we're monitoring ONNX and Core ML improvements.

TL;DR for Developers

  • Voice AI on jobsites works when you focus on structured data entry, not freeform transcription.
  • Hybrid edge-cloud outperforms both pure-edge and pure-cloud in practice.
  • Domain-specific grammar (80 lines) beats generic models (100GB) for construction terminology.
  • Fallback + recovery flows are mandatory; 95% accuracy + friction beats 99% accuracy + latency.
  • Measure what matters: time from jobsite to quote, not speech recognition F1 score.

If you're building estimating tools, voice is table-stakes now. But implement it right—the details in parsing, fallback, and offline-first architecture are what separate a novelty from a workflow revolution.


Olivier Ebrahim, founder of Anodos

Anodos is a French SaaS platform for construction SMBs, featuring real-time jobsite management, voice-driven quoting, Factur-X 2026 invoicing, and GPS-tracked crew scheduling. We've shipped voice estimating to 50+ worksites and are obsessed with making jobsite tools that actually work offline.

Top comments (0)