Voice AI for Jobsite Estimating: A Developer Perspective
The Problem Nobody Talks About
Construction estimators spend 4–6 hours per week typing quotes into spreadsheets or PDF forms. They're outdoors, on ladders, in dust, squinting at sun-bright screens—and still they hunt-and-peck single-handed because the other hand holds a measuring tape or clipboard.
In 2024, we shipped voice-to-estimate at Anodos, a construction SaaS for French SMBs. What we learned is that voice AI isn't a feature—it's a UX game-changer when built right. This post breaks down the technical and operational realities of integrating voice estimating into a web app designed for jobsites.
1. Choosing the Right Speech-to-Text Engine
The naive approach : Plug in Google Cloud Speech-to-Text, send all audio upstream, done.
The jobsite reality : Construction sites have patchy networks. Your WiFi router is 200m away. Cellular is 1-2 bars. You need fallback offline capability and latency under 1 second to feel natural to the user.
We evaluated:
- Google Cloud Speech-to-Text (v2) : 95% accuracy on English, ~600ms round-trip. Excellent for demo. Fails on site 30% of the time (network hiccups). Cost: $0.024 per 15 seconds.
- Deepgram : Slightly lower accuracy (92%), but ~200ms latency, better handling of heavy accents (French/Occitan). Pricing tiered ($0.0043–0.009/min). Our pick for production.
- Whisper (OpenAI) : Runs locally on device. Accuracy 88–92%, but inference cost 2–3GB model size, kills battery. Fine for tablet, not for low-end Android phones.
Our choice : Hybrid. Primary stream to Deepgram (low-latency cloud), fallback to on-device Whisper if network fails. Syncs estimates up when connection returns.
2. Domain-Specific Language & Accuracy Tricks
Generic speech engines hear "pour toi" (for you) instead of "P.T." (short for "price adjustment"). They miss "lintel 2x10" because lumber specs aren't in most training sets.
What we do:
Custom vocabulary file : Upload a JSON list of construction terms (lintel, soffit, fascia, Factur-X, etc.) to Deepgram's API with
model=nova-2-general+ custom vocab. Boosts domain term recognition by ~15%.-
Post-processing regex : After transcript returns, run a two-pass filter:
- Pass 1 : Match known unit patterns ("12 foot by 8", "3m x 2m", "qty 5") and standardize them.
- Pass 2 : Fuzzy-match against your item database (if "liner" is detected in electrical context, correct to "liner", etc.).
User corrections loop : Before finalizing, display the transcript as an editable draft. Users can tap a word to correct it. Store corrections in a per-user "correction dictionary" and feed it back as bias on the next call.
Example workflow:
async function estimateFromVoice(audioBlob, userId) {
const transcript = await deepgram.transcribe(audioBlob, {
model: 'nova-2-general',
custom_search_vocabulary: ['lintel', 'soffit', 'fascia'],
language: 'en'
});
// Post-process
let corrected = applyConstructionRegex(transcript.results[0].alternatives[0].transcript);
corrected = fuzzyMatchAgainstItemDB(corrected, userId);
// Show draft to user for review
return {
transcript: corrected,
confidence: transcript.results[0].alternatives[0].confidence,
editable: true
};
}
3. Integration & UX Patterns That Work
Voice estimating isn't "speech-to-text." It's speech-to-structured-data.
A user says: "Two lintel windows, rough opening 3 by 4, treated lumber, labor install, and paint trim."
Your system must parse this into a line item:
- Item: Window lintel
- Quantity: 2
- Dimensions: 3'×4'
- Material: Treated lumber
- Labor: Install + Paint trim
- Unit price: (lookup from DB)
- Total: (calculate)
The pattern that works:
Chunked audio : Capture 5–10 second chunks of speech. On each chunk, stream to Deepgram (streaming API, not batch). Display interim results in real-time.
Slot filling, not free text : As the transcript grows, use a lightweight NLU (we use Rasa, but GPT-3.5 also works) to extract slots:
{item, qty, dimensions, material, labor_type}. If a slot is missing, prompt the user: "How many units?"Confirmation before commit : Never auto-add to estimate. Always show "Add 2 × 3×4 window lintel (treated) + labor (install, paint)?" with a 2-second timeout. User taps ✓ or says "no, delete."
4. Metrics That Matter
After shipping this to 50+ jobsites, here's what we measure:
- Transcription accuracy : 89% word-error rate (WER) on-site. Down from 95% in lab because ambient noise (saws, generators). Acceptable for construction—users correct 1 in 10 items.
- Latency (P95) : 680ms from end of speech to final transcript. Users feel it's instant. Anything >1000ms feels laggy.
- Adoption rate : 67% of estimators use voice ≥once per week after 3 weeks. Full adoption (>5 times/week) takes 6–8 weeks.
- Time savings : Estimate creation drops from 12 min (typed) to 4–5 min (voice). Not a 2x, but meaningful on a 50-item quote.
- Error rate (post-voice) : Slight uptick in line-item errors (1–2% more than typed) because users rush corrections. Mitigated by stronger confirmation flow.
5. Lessons & Gotchas
Lesson 1: Don't assume your speech engine is "smart." It's not. A $5 regex beats a $50k ML model on domain-specific problems. Combine both.
Lesson 2: Network resilience is non-negotiable. Jobsites are offline by default. Build offline-first, sync on reconnect.
Lesson 3: Users will speak in sentence fragments, slang, and code-switches. ("3 by 2, no wait, 3 by 4, same as the last one.") Train your slot-filling on real user utterances, not boardroom examples.
Lesson 4: Voice isn't faster for all workflows. It shines for list-building (adding line items). It's slower for editing (change qty in item 7). Hybrid UI wins: voice for input, tap-to-edit for refinement.
Takeaway
Voice AI on jobsites works because it solves a real UX constraint—you can't type outdoors. The technical bar is high (accuracy, latency, offline fallback), but the business payoff is clear: estimators spend less time at a desk, more time on-site closing jobs.
If you're building SaaS for field workers, voice should be on your roadmap. Start with a Deepgram + Rasa + offline Whisper stack, measure adoption weekly, and iterate on your slot-extraction logic. The magic is in the domain work, not the AI.
Olivier Ebrahim is founder of Anodos, a construction SaaS for French SMBs building voice-first jobsite estimating, real-time crew coordination, and Factur-X 2026 compliance tools.
Top comments (0)