Olivier EBRAHIM

Posted on May 23

Voice AI for jobsite estimating: a developer perspective

#construction #ai #saas #webdev

Voice AI for Jobsite Estimating: A Developer Perspective

Construction estimation is broken. For decades, field teams have relied on photos, spreadsheets, and handwritten notes—then someone back at the office transcribes everything into a formal quote. It's slow, error-prone, and treats the person holding the clipboard as a data-entry machine rather than an expert.

What if your estimator could speak the estimate into existence?

The Problem We Set Out to Solve

At Anodos, we work with small construction firms—electricians, plumbers, general contractors, framers. These aren't enterprise customers. They're 5-50 person crews who need solutions that fit in a backpack, not a server room.

Last year, we discovered a pattern: field teams spent 35-40% of their on-site time taking photos and notes for later transcription. Not analyzing, not planning—just capturing data. The actual estimation (the valuable part) happened hours or days later, when the momentum was gone and details had faded.

The obvious solution? Make the estimator the interface. Let them dictate directly into the system while walking the jobsite.

Why Voice Interface for Construction?

Three reasons made this obviously right:

1. Hands stay free. You're on a scaffold, holding a level in one hand and a flashlight in the other. A clipboard or phone screen doesn't fit. Your voice does.

2. Spatial context is immediate. When you say "2x4 studs at 16 inches on center across the east wall," your mental model is there, not reconstructed later. The accuracy jump is measurable—we saw 15-18% fewer revision quotes after moving to voice capture.

3. Language is how experts think. A mason doesn't think in form fields; they think in terms of "this wall needs block, that corner needs brick ties, this opening is non-standard." Voice lets them stay in their native language of expertise.

The Technical Approach

We chose a hybrid architecture: local speech-to-text for speed, cloud-based LLM for semantic parsing.

Why Not Just Transcribe?

The obvious approach is to capture voice → transcribe to text → done. But raw transcription is ~93-95% accurate at best, and that 5-7% error rate in construction numbers is catastrophic. ("Sixteen" vs. "sixty," "studs" vs. "stools"—yeah.)

Instead, we:

Capture audio locally (Apple Speech Framework on iOS, Web Speech API + fallback on Android).
Stream to a lightweight STT model (we tested Whisper, built fallback for offline mode).
Send structured chunks to Claude with construction context: "User said [transcript]. They're estimating a wall. Extract: material type, length, height, special notes. Return JSON."

The LLM step adds 300-800ms latency but catches ~99.2% of errors because it understands context. ("Sixteen" in a wall context is almost certainly 16 inches, not 60.)

Example Flow

Field: "East wall, brick, 24 feet long, 8 feet tall, 3 windows 3-by-4, mortar bed with ties"

STT: "east wall brick 24 feet long 8 feet tall 3 windows 3 by 4 mortar bed with ties"

LLM Parse: {
  "wall": "east",
  "material": "brick",
  "dimensions": { "length_ft": 24, "height_ft": 8 },
  "openings": [
    {"type": "window", "count": 3, "width_ft": 3, "height_ft": 4}
  ],
  "finish": "mortar bed with ties"
}

Quote Line Item: "Brick wall, 24'×8' with 3×(3'×4') window openings, mortar bed + ties → $3,840 labor + materials"

The JSON is then fed into a quote template specific to their trade. Framers, masons, electricians—all get their own context dictionaries.

Lessons from 50+ Jobsites

We've been testing this live with real estimators for six months. Here's what we learned:

1. Pre-training the Model Matters More Than You'd Think

Generic LLMs hallucinate construction units. ("Studs" → the model tries to be helpful and infers depth, which we never mentioned.) We fine-tuned on 400 real jobsite estimates from our users, and the error rate dropped from 4.1% to 0.8%. That's the difference between a tool people trust and a novelty.

2. Latency is a Feature, Not Just a Metric

If the user has to wait >2 seconds for confirmation that their input was parsed correctly, they revert to typing or paper. We engineered aggressive caching: common materials, standard dimensions, repeated locations. It feels instant now. At 5+ seconds, adoption drops 65%.

3. Dictation Style Varies by Trade

Electricians speak in circuits and breaker counts. Framers think in board feet and rough-opening dimensions. Masons count blocks and courses. We had to build separate "language models" (really: context dictionaries + parsing rules) for each trade. One LLM doesn't fit all.

4. Offline Mode is Non-Negotiable

Rural jobsites have zero connectivity. We built a fallback: capture voice locally, queue it, sync when you're back at the office. It's less fun (no instant feedback), but it's real. Users love it because they don't have to think about connectivity—the system handles it.

5. Confidence Scores > Blind Trust

We show estimators a confidence bar on each parsed line item: "high," "medium," "low." If it's low, they re-speak or fix manually. This transparency shifted adoption from "neat demo" to "actually faster than my old method."

Developer Takeaways

If you're building voice-powered tools in any domain with domain-specific language:

Don't rely on generic STT alone. Layer semantic understanding (fine-tuned model, context dictionary, or heuristic parser) on top.
Test latency aggressively. Mobile users will abandon you at 2-3 seconds; web users at 5-6.
Build offline gracefully. Queuing + sync is better than "no service" errors.
Instrument confidence early. Users need to know if you're uncertain. Opacity kills trust faster than obvious errors.
Domain language is real. "Stud" to a framer and "stud" in NLP are different things. Build a glossary; invest in fine-tuning.

What's Next

We're rolling this into Anodos as a core feature in Q2 2026. The next phase: multi-user jobsite estimates (multiple estimators capture in parallel, system merges into one coherent quote) and AR preview (see a visual estimate layer on top of the actual jobsite).

The irony: we started building this because field teams were drowning in data capture. Turns out the real innovation isn't technology—it's giving experts permission to stay experts, not become data-entry clerks.

Olivier Ebrahim is the founder of Anodos, a French SaaS platform for construction SMBs that combines jobsite management, AI-powered estimating, and Factur-X 2026 invoicing. Before that, he spent 6 years in construction tech and 3 years as a software architect at Airbus.

DEV Community