Olivier EBRAHIM

Posted on May 6

Voice AI for jobsite estimating: a developer perspective

#construction #ai #saas #webdev

Voice AI for Jobsite Estimating: A Developer Perspective

Building sites are chaotic. A site manager needs to estimate material quantities, labor costs, and timelines—often while standing in dust, dirt, and noise, clipboard in hand. What if they could just speak and have an AI transcribe, parse, and generate a formatted estimate? This is not science fiction—it's the operational backbone of modern construction SaaS.

In this article, I'll walk through the engineering challenge of voice-to-estimate pipelines, the real-world gotchas, and how teams are solving them in 2026.

The Pipeline: Audio to Invoice

The naive mental model: user speaks → AI understands → system generates estimate → done.

Reality is messier. Here's the actual architecture we've seen work:

[Audio from jobsite] 
  → Whisper-large-v3 (STT)
  → LLM prompt (extract intent + entities)
  → Rules engine (apply pricing, taxes)
  → Factur-X output (PDF/A-3)
  → Sign & deliver

Let's break each layer.

1. Speech-to-Text: Beyond English

Whisper (OpenAI's open-source model) is robust for English. For French construction jargon—terrassement, étanchéité, planelles—you need careful prompt engineering.

The problem: French construction vocabulary is old and technical. Whisper sometimes misinterprets "linéaire" (linear meter) as "liinéaire" or conflates "chape" (screed) with "chappe".

The solution: Fine-tuning or a simple post-processing layer:

domain_corrections = {
    "liinéaire": "linéaire",
    "béton armer": "béton armé",
    "gypse": "gypse",  # rare but real
}

def clean_transcription(text):
    for error, correction in domain_corrections.items():
        text = text.replace(error, correction)
    return text

This buys you ~95% accuracy for trade-specific terms.

2. Entity Extraction: The Hard Part

After STT, you need to extract:

Material: "2 tonnes of reinforced concrete"
Quantity: "2"
Unit: "tonnes"
Quality level: "reinforced" (affects price)
Location (optional): "ground floor" (affects delivery)

A production system uses a small LLM (Mistral 7B or Claude 3 Haiku) with a structured prompt:

prompt = f"""
You are a construction estimator. Parse this voice transcript 
and extract line items as JSON.

Transcript: "{transcript}"

Output JSON schema:
{{
  "items": [
    {{
      "material": "string",
      "quantity": float,
      "unit": "string (m, m2, m3, tonnes, pieces, hours)",
      "quality_tier": "economy|standard|premium",
      "notes": "string"
    }}
  ],
  "confidence": 0.0-1.0
}}

Respond with ONLY valid JSON. If confidence < 0.7, set confidence field 
and include a clarification request in notes.
"""

response = llm.generate(prompt)
items = json.loads(response)

Critical insight: Don't aim for 100% automation. Flag low-confidence extractions (< 0.7) and route them to a human reviewer. In production, 85-90% of voice inputs pass auto-approval; the rest get a 30-second manual review.

3. Pricing & Rules Engine

Now you have structured line items. You need to apply:

Regional labor rates (SNCR, CCMI, etc. in France)
Material surcharges (fuel, import duties, logistics)
VAT & local taxes
Markup rules (client type: public/private affects VAT reclaim)

This is NOT generic. French BTP has Factur-X compliance requirements, URSSAF social contributions, and regional collective agreements. A generic SaaS trying to handle "all European construction" will fail.

class EstimateEngine:
    def __init__(self, region="Île-de-France", client_type="private"):
        self.labor_rates = self.load_ccmi_rates(region)
        self.tax_rules = self.load_vat_rules(client_type)

    def price_item(self, item):
        base_cost = self.get_material_cost(item["material"])
        labor = self.calculate_labor(item["quantity"], item["unit"])
        total = base_cost + labor
        total_with_tax = total * self.tax_rules["multiplier"]
        return {
            "line": item,
            "net_amount": total,
            "tax_amount": total_with_tax - total,
            "gross_amount": total_with_tax
        }

4. Output: Factur-X Compliance

In France, 2026 regulations require all invoices to be in Factur-X format (PDF/A-3 with embedded XML). This isn't optional—it's audit-enforced.

Generating Factur-X from JSON is straightforward but fussy:

from lxml import etree

def generate_factur_x(estimate):
    """Create Factur-X XML + embed in PDF/A-3"""
    root = etree.Element("Invoice")
    root.set("version", "D3")

    # Add header
    header = etree.SubElement(root, "InvoiceHeader")
    etree.SubElement(header, "InvoiceNumber").text = estimate["id"]
    etree.SubElement(header, "InvoiceIssueDate").text = estimate["date"]

    # Add line items
    lines = etree.SubElement(root, "InvoiceLines")
    for item in estimate["items"]:
        line = etree.SubElement(lines, "InvoiceLine")
        etree.SubElement(line, "LineDescription").text = item["material"]
        etree.SubElement(line, "LineQuantity").text = str(item["quantity"])
        etree.SubElement(line, "LineAmount").text = str(item["gross_amount"])

    xml_str = etree.tostring(root, pretty_print=True, encoding="utf-8")

    # Embed in PDF/A-3 (use PyPDF2 or reportlab)
    return pdf_with_embedded_xml(estimate, xml_str)

Libraries like facturx-python and PyPDF2 handle the heavy lifting.

Real-World Gotchas

1. Network Latency on Site

A jobsite manager using voice on a 4G connection can't afford a 5-second round-trip to a distant LLM API. Deploy local inference (Whisper + Mistral on a laptop or edge device) for sub-500ms latency. Cloud fallback for complex extraction.

2. Accent & Background Noise

French accents vary wildly. A Breton foreman and a Parisian engineer pronounce "crépi" (render) differently. Whisper handles this reasonably, but:

Collect domain-specific training data (record 50-100 real jobsite clips)
Fine-tune Whisper on your dialect mix
Use noise filtering (Silero VAD, ffmpeg's noise reduction)

3. Vague Input

A manager says: "I need concrete." Which type? Reinforced C30/37 or plain lean concrete? How much? A production system must either:

Ask clarifying questions (bot: "You mentioned concrete. Is this reinforced, and how many cubic meters?")
Default to premium (overestimate, then let the client downgrade)

Anodos handles this with a 2-phase voice flow: capture → clarification → final estimate.

4. Regulatory Compliance

French BTP estimates are legally binding (Code de la consommation). An underestimate can trigger liability. Always:

Log every voice input + transcription (audit trail)
Flag estimates with confidence < 0.9 for review
Version your pricing rules (trace which rates applied on 2026-01-15)
Require digital signature (eIDAS-compliant)

Performance Metrics That Matter

In production, track:

STT accuracy: % of words matching human transcript
Entity extraction F1: precision vs. recall on line items
Confidence distribution: % of auto-approved estimates
Turnaround time: voice input → PDF in hand (target: < 10 seconds)
Cost per estimate: Whisper (~$0.01) + LLM (~$0.002) + infrastructure = ~$0.015-0.03 per voice estimate

At scale (1000 estimates/day), this beats manual typing by 10x.

Lessons Learned

Whisper is good, not magic. Invest in domain-specific cleaning.
Don't over-automate. Flag uncertain extractions and let humans decide.
Respect regional rules. French BTP ≠ Spanish construction ≠ UK frameworks.
Build audit trails. You're generating legal documents; compliance comes first.
Local inference wins. Jobsite connectivity is unreliable; deploy edge models.

The future of construction estimating is voice-first. It's not about replacing estimators—it's about freeing them from data-entry drudgery so they can focus on judgment calls, site risk, and client relationships.

Olivier Ebrahim, founder of Anodos, is building voice AI infrastructure for French construction SMBs. Over 50 jobsites, voice-first estimating reduces paperwork by 80% and turnaround time to same-day quotes.

DEV Community