DEV Community

Olivier EBRAHIM
Olivier EBRAHIM

Posted on

Voice AI for Jobsite Estimating: A Developer Perspective

Voice AI for Jobsite Estimating: A Developer Perspective

Building estimators spend hours hunched over spreadsheets, struggling with poor handwriting on site photos, and entering the same data twice (once on paper, once in the office). This workflow is broken. Voice AI changes everything—and it's simpler to implement than most developers think.

In this article, I'll walk you through the real-world lessons we learned deploying voice-to-estimate features in a production SaaS for French construction SMBs. This isn't hype; it's practical architecture.

The Problem: Why Voice Matters on a Jobsite

A construction foreman needs to create an estimate for concrete repairs. Current flow:

  1. Walk the site with a clipboard and pen (messy, imprecise)
  2. Return to the office
  3. Type notes into Excel or your estimating software
  4. Cross-reference material prices from supplier catalogs
  5. Pray nothing was misheard or miswritten

Each step compounds error. Voice AI collapses steps 1–3 into 30 seconds.

Why not text input? Jobsite conditions: wet hands, heavy gloves, dusty screens, poor signal. A foreman can't type. But they can talk. A simple phrase like "Redouter trois mètres carré de ciment degradé" (three square meters degraded concrete) becomes:

  • Automatically recognized and categorized
  • Linked to unit costs from your database
  • Inserted into an estimate line-item in real time

The UX is frictionless. The ROI is immediate: fewer re-entries, faster estimates, fewer back-office hours.

Architecture: How We Built It

We're using a stack of standard tools. Nothing exotic.

1. Audio Capture & Streaming (Client-Side)

On iOS (native Swift) or Android (Kotlin), capture raw PCM audio at 16 kHz, 16-bit. Don't try to compress on-device—the inference latency of transcoding often exceeds the latency gain. Stream raw frames to your backend via WebSocket.

Why WebSocket? Low latency, persistent connection, server can push results back as they arrive.

┌─────────────┐         WebSocket          ┌──────────────┐
│  Jobsite    │◄────────────────────────────►│   Speech     │
│   App       │      Raw PCM 16kHz 16-bit   │  Inference   │
│  (iOS/And)  │                             │   Server     │
└─────────────┘                             └──────────────┘
Enter fullscreen mode Exit fullscreen mode

Pro tip: Use Apple's Speech framework on iOS (on-device, free). For Android, streaming to a cloud service (Google Cloud Speech, Azure) is cleaner than bundling a local model.

2. Speech-to-Text (STT) API

Don't build your own speech recognition—it's a solved problem. Choose between:

  • Google Cloud Speech-to-Text: High accuracy for French, context hints, real-time streaming API, ~$0.002 per 15-second audio.
  • Azure Speech: Competitive pricing, similar quality.
  • OpenAI Whisper: If you want on-prem inference, fine-tuned for domain vocabulary.

For construction vocabulary (beton, devis, chantier, etc.), you'll want to configure context hints. Both Google and Azure allow you to pass a custom vocabulary list at request time.

Cost reality: At 50 estimates/day per user, 100 customers, ~$150/month speech budget. Negligible.

3. NLU: Entity Extraction & Classification

Raw transcription is just text. You need to extract:

  • Materials ("beton" → concrete, unit: m²)
  • Quantities (3, 5.5)
  • Adjectives / conditions ("dégradé" → damaged, price multiplier +15%)
  • Labor ("deux jours de main d'oeuvre" → 2 labor days)

Don't use regex. Use a lightweight NLU model. Options:

  • Rasa: Open-source, trainable on your domain, Python, ~50 MB footprint.
  • spaCy + custom classifiers: Lightweight, fast.
  • Claude (via API): Overkill but works; slower for real-time.

We chose Rasa. Training data: 500 example phrases covering 80% of real jobsite speech patterns. Time to first model: 3 days. Accuracy at 94% after 2 weeks in production.

Input: "Trois mètres carré de beton dégradé à enlever"
Output: {
  "material": "concrete",
  "quantity": 3,
  "unit": "m²",
  "condition": "damaged",
  "action": "removal",
  "labor_hours": 0.5
}
Enter fullscreen mode Exit fullscreen mode

4. Estimate Line Item Generation

Once entities are extracted, join against your product/material database:

  • Material → unit cost (from your supplier integrations or manual catalog)
  • Quantity × unit cost = line item total
  • Condition adjustment (damaged concrete = +15% labor) → apply multiplier
  • Auto-populate labor hours if provided
  • Insert line into the live estimate on the jobsite app

This happens in <200 ms. User hears their voice transcribed, sees it appear as a complete line item. Zero context switching.

5. Quality Gates & Human Review

Never auto-commit an estimate to final status. Every voice-generated line item starts as a draft suggestion with confidence scores:

  • ≥95% confidence: auto-accept, show to foreman for review
  • 75–95%: flag for human review (estimator or supervisor)
  • <75%: reject, ask user to repeat

In production, ~88% of lines are ≥95% confidence (French is well-trained). The foreman can edit, delete, or approve in the app before final submission.

Real-World Lessons

1. Silence is a Feature

Users expect the app to know when they're done speaking. Implement a silence threshold: if >1 second of silence is detected, treat it as sentence-end and trigger NLU. Don't wait for manual "Done" buttons—it kills UX.

2. Domain Vocabulary is Crucial

Generic speech models hallucinate on construction jargon. "Maçonnerie" becomes "ma-connerie" (bad pun). Always fine-tune with domain examples.

3. Offline Fallback

Jobsite connectivity is unstable. If STT fails mid-phrase, gracefully degrade to manual text input. Don't force users to re-speak.

4. Cost Optimization: Compress Audio Server-Side

Streaming raw 16 kHz audio over 4G burns data. After receiving a few kilobytes, detect silence on the server side and close the stream early. Saves ~40% bandwidth.

5. Regulatory: Factur-X 2026 Integration

In France, any estimate touching a B2B or B2G transaction must eventually be Factur-X-compliant (invoice format mandate starting 2026). Build your estimate-to-invoice pipeline with Factur-X in mind from day one. It's not hard—it's a structured XML schema—but retrofitting it is painful.

Putting It Together: Real Example

A foreman at a concrete cutting job uses Anodos—a construction SaaS that integrated voice estimating:

  1. 11:35 AM, on-site: "Découpe béton, trois mètres linéaires, profondeur dix centimètres"
  2. STT output: "Découpe béton trois mètres linéaires profondeur dix centimètres"
  3. NLU: material=concrete_cutting, quantity=3, unit=linear_meters, depth=0.1m
  4. Database join: concrete_cutting @ €45/m + depth premium = €52/m
  5. Line item: "Concrete cutting, 3 LM × €52 = €156" → auto-inserted into estimate
  6. Foreman reviews: taps ✓ to accept
  7. 11:36 AM: Estimate ready to send client. No office time needed.

Before voice AI, this would take 20 minutes of back-and-forth.

Implementation Timeline

Week 1: Integrate a cloud STT API (Google or Azure), WebSocket streaming from mobile app.
Week 2: Build basic NLU extractor using Rasa or spaCy, connect to material database.
Week 3: Test with 5–10 real users, collect speech samples, fine-tune model.
Week 4: Deploy to production, monitor error rates, iterate on false positives.

Total: 4 weeks. Team: 1 backend engineer, 1 mobile engineer, 1 PM for training data curation.

Pitfalls to Avoid

  • Don't ship without domain fine-tuning. A generic English speech model on French jobsite vocab will fail 15% of the time.
  • Don't stream audio uncompressed over poor cellular. Implement jitter buffers and graceful degradation.
  • Don't skip human review for high-value estimates. Always let the user approve before submission, even if confidence is 99%.
  • Don't ignore privacy. Audio streams contain jobsite details and worker voices. Encrypt in transit, delete server-side after transcription, comply with GDPR.

Conclusion

Voice AI for construction estimating is not a gimmick. It's a 10x improvement on paper-based workflows. For developers, it's accessible: combine a cloud STT API, a lightweight NLU model, and a solid mobile UI. Expect 4 weeks to MVP, 12 weeks to polished production.

The foreman wins. The office admin wins. The business wins. And you get to deploy something that feels like magic to your users.

Ready to build? Start with a simple Rasa model and Google Cloud Speech. Ship early, gather real speech data, iterate fast.


About the Author

Olivier Ebrahim is the founder of Anodos, a construction SaaS that brings voice-first estimating, real-time jobsite coordination, and Factur-X 2026 compliance to French SMB builders. He's spent the last 18 months deploying voice AI on 50+ jobsites across France and learned every hard lesson in this article the painful way.

Top comments (0)