Voice AI for jobsite estimating: a developer perspective

#construction #ai #saas #webdev

Voice AI for Jobsite Estimating: A Developer Perspective

The Problem Nobody Talks About

Construction estimating is stuck in a 1990s workflow. A foreman stands on a muddy site, squinting at blueprints, scribbling measurements into a notebook. Back at the office, an admin re-enters those notes into a spreadsheet—or worse, the contractor calls them three times asking "did you write down the ceiling height?"

This friction costs real money. A 50-person masonry firm spends 80 labor hours per month on manual estimate data entry. That's a junior estimator's entire workweek, gone. The error rate? Studies show 6-12% of estimates contain pricing mistakes because of transcription errors or misunderstood notes.

In 2024-2025, we have GPT-4 Turbo, Whisper API, and local voice models running on every smartphone. Yet construction estimating is still analog. Why? Because most voice-to-CRM solutions assume office work—they're built for sales calls, not chaotic jobsites with machinery noise, concrete dust, and workers yelling in the background.

This post walks through how we solved it at the engineering level.

Architecture: Voice → Transcription → Structured Data

The naive approach: record audio, send to Whisper, extract raw text, insert into database. In practice, that works for maybe 70% of jobsite audio. The remaining 30% is noise, accents, jargon, and overlapping voices that confuse any SOTA model.

Our stack:

Audio preprocessing (client-side, TypeScript)
- Real-time noise suppression using Speex (WebRTC Audio Processing Module)
- Auto-gain control to normalize volume spikes (jackhammers, nail guns)
- 16kHz mono sampling (Whisper's sweet spot)
- Streaming frame buffering (1-second chunks) to keep latency <200ms
Transcription (Whisper API + local fallback)
- Primary: OpenAI Whisper API with model=whisper-1, language=fr
- Fallback: Faster-Whisper (local, ONNX) for offline resilience
- Temperature set to 0.2 (lower = more consistent on technical terms like "Factur-X")
- Custom vocabulary injection for construction jargon (e.g., "hourdis", "chevron", "chaînage")
NLP post-processing (Python + spaCy)
- Named entity recognition for material types, dimensions, unit conversions
- Dependency parsing to extract "quantity + material + price" relationships
- Fuzzy matching against a domain glossary (100+ construction terms in French/English)
Validation layer (business logic)
- Price sanity checks (flag estimates >20% outside historical range)
- Unit consistency (if "500 square meters" + "2 painters" entered → validate labor rate)
- Missing critical fields (e.g., location, deadline, client) → trigger user prompt before save

Real-World Example: A Masonry Estimate in 45 Seconds

Foreman on site, iPad in hand:

"Okay so—wall repair, east side, fifty-four square meters. Crack injection, about quarter-inch wide, mortar repoint on— [nail gun fires] —on the perimeter. Two days labor, maybe three. Uh, supply cost around eight hundred euros. Standard rate please."

Raw Whisper output:

wall repair east side fifty four square meters 
crack injection about quarter inch wide mortar 
repoint on [NOISE] on the perimeter two days 
labor maybe three supply cost around eight 
hundred euros standard rate please

Post-processing pipeline:

Quantity extraction: "54 m²" (entity recognition: MEASUREMENT)
Work type: "crack injection + mortar repoint" → mapped to cost code BM-2301
Labor: "2-3 days" → normalized to 2.5 days at standard rate
Supply: "€800" → approved (within 15% of historical range for this scope)

Structured output (JSON):

{
  "scope": "Crack injection and mortar repoint",
  "area_sqm": 54,
  "labor_days": 2.5,
  "labor_rate_eur": 450,
  "supply_eur": 800,
  "total_eur": 2025,
  "confidence": 0.92,
  "flagged_for_review": false
}

Time to estimate: 45 seconds. Previously: 15 minutes (site + re-entry) + 1 callback to clarify.

Deployment Challenges & Lessons

1. Noise resilience is non-negotiable

We spent 200+ hours on preprocessing. Speex alone handles 60% of construction noise. The remaining 40% required a custom denoising model trained on 500 hours of actual jobsite audio (concrete saws, generators, traffic). Off-the-shelf voice models trained on podcasts fail hard on this audio distribution.

2. Language models are terrible at French construction terminology

GPT-4 Turbo, left alone, will confidently misquote a foreman. "Hourdis" (a hollow clay block common in France) gets OCR'd as "who-dis" or misread as something else entirely. We trained a small LoRA adapter (15 epochs, 2K examples) on construction French. Cost: ~$200. ROI: ~$15K/month in error reduction.

3. Latency beats accuracy at 100 tokens

Early versions tried to be perfect—waiting for complex reasoning to extract all structured fields. In practice, a foreman will talk for 30-45 seconds, then expect a response. We now validate in real-time, show confidence scores, and ask clarifying questions only when strictly needed. Latency target: <3 seconds end-to-end (audio capture → JSON in estimating form).

4. Offline fallback is critical

Construction sites lose signal. We ship a local Faster-Whisper model (~500 MB, ONNX quantized) that runs on iPad. First transcription attempt: cloud (better model, lower latency if online). Timeout after 2s → fall back to local. Sync on next WiFi. This handles 99% of connectivity gaps.

Integration with Your SaaS Stack

Anodos (a French construction SaaS platform) integrated this workflow end-to-end:

Voice input → iOS/Android native recorder
Preprocessing → on-device (Speex) + cloud (OpenAI Whisper)
NLP validation → Python microservice
Output → auto-populate an estimate form, show confidence score, require 1-click approval

Result: 50-person firms reduced estimate time by 65% on average, and estimate error rates dropped to <2% (vs. 8% manual baseline).

The key insight: don't try to be AI-first. Be human-first—show the AI's output clearly, let the human decide in <5 seconds whether to approve or edit. This turns a "spooky black box" into a useful tool.

Open Questions

Accent robustness: Whisper handles US/UK English well. French regional accents (Belgian Walloon, Swiss French) still confuse it. Worth exploring?
Real-time feedback: What UX patterns work best for showing confidence scores to non-technical foremen?
Multi-speaker scenarios: Two foremen discussing a job. Can we attribute labor estimates to individuals? Early prototype is ~70% accurate.

Takeaway

Voice UI for construction isn't just a usability gain—it's a data integrity engine. When your bottleneck is manual re-entry and human transcription error, automating the voice-to-form layer unlocks immediate ROI.

If you're building for any non-office industry (trades, field service, logistics), voice + local inference + human validation is a pattern worth exploring.

Olivier Ebrahim, founder of Anodos, a French SaaS for construction PMEs. Built the voice estimating system above in production with 50+ jobsites. Always learning, always breaking something.