DEV Community

Cover image for We measured 72% 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it
Zeeshan Ghazanfar
Zeeshan Ghazanfar

Posted on

We measured 72% 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it

We measured 72% → 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it

We deployed a Text-to-SQL agent on a legacy ERP with 612 tables, ~8,400 columns, and inconsistent naming across modules. Initial offline eval looked acceptable. Production said otherwise.

Baseline

Model: GPT-4 class via API
Context: full schema dump + user query
Prompt: standard "generate SQL from question"
Eval set: 220 business questions from analysts

Results:

  • Exact match accuracy: 72%
  • Execution success: 81%
  • Latency: ~4.8s avg
  • Production failure rate (week 1): 31% required human correction

What failed

  1. Schema overload
    600+ tables in context diluted attention
    Wrong joins between similarly named tables like orders vs order_hdr vs order_archive

  2. Semantic mismatch
    Business language did not match schema
    Example: "revenue" mapped to total_amount instead of net_sales

  3. Join path ambiguity
    Multiple valid joins existed
    Model picked shortest path, not correct one

  4. Silent logical errors
    Queries executed but returned wrong numbers
    Hardest to detect

What moved accuracy to 91%

This was not fixed with prompting alone. We changed the system.

  1. Schema retrieval layer

Instead of passing full schema:

  • Embedded table names, column names, sample values
  • Retrieved top 8–15 relevant tables per query

Impact:

  • Accuracy: 72% → 84%
  • Latency reduced by ~1.2s
  1. Join graph constraints

Built join graph from DB metadata + query logs
Forced model to only use valid relationships

Impact:

  • Accuracy: 84% → 88%
  • ~60% reduction in incorrect joins
  1. Business term mapping

Added translation layer:

  • revenue → net_sales
  • customer → client_id
  • orders → sales_order_hdr

Hybrid approach (rules + embeddings) worked best

Impact:

  • Accuracy: 88% → 90%
  1. SQL self-check and retry

Second pass:

  • Validate generated SQL against schema and intent
  • Allow one retry

Impact:

  • Accuracy: 90% → 91%
  • Execution success: 81% → 93%
  1. Evaluation change

Switched to:

  • exact match OR result equivalence
  • added real production edge cases

Initial accuracy dropped, then stabilized with real signal

What did not work

  • Larger models: <2% gain
  • Longer prompts: worse performance
  • Few-shot examples: inconsistent results
  • Free schema exploration: more hallucinated joins

Final production metrics

  • Accuracy: 91%
  • Execution success: 93%
  • Avg latency: 3.6s
  • Human intervention: 31% → 9%

Still unsolved

  • Complex financial aggregations
  • Time-based comparisons across inconsistent date fields
  • Implicit business rules not present in DB

Bottom line

Text-to-SQL at enterprise scale is not a prompt problem.

It is:

  • retrieval
  • constraints
  • evaluation

Without controlling schema exposure and join paths, accuracy plateaus early.

Top comments (1)

Collapse
 
droidment profile image
Droidment

Very interesting post. Thanks for sharing.