Zeeshan Ghazanfar

Posted on Apr 27

We measured 72% 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it

#erp #talktodb #database #ai

We measured 72% → 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it

We deployed a Text-to-SQL agent on a legacy ERP with 612 tables, ~8,400 columns, and inconsistent naming across modules. Initial offline eval looked acceptable. Production said otherwise.

Baseline

Model: GPT-4 class via API
Context: full schema dump + user query
Prompt: standard "generate SQL from question"
Eval set: 220 business questions from analysts

Results:

Exact match accuracy: 72%
Execution success: 81%
Latency: ~4.8s avg
Production failure rate (week 1): 31% required human correction

What failed

Schema overload
600+ tables in context diluted attention
Wrong joins between similarly named tables like orders vs order_hdr vs order_archive
Semantic mismatch
Business language did not match schema
Example: "revenue" mapped to total_amount instead of net_sales
Join path ambiguity
Multiple valid joins existed
Model picked shortest path, not correct one
Silent logical errors
Queries executed but returned wrong numbers
Hardest to detect

What moved accuracy to 91%

This was not fixed with prompting alone. We changed the system.

Schema retrieval layer

Instead of passing full schema:

Embedded table names, column names, sample values
Retrieved top 8–15 relevant tables per query

Impact:

Accuracy: 72% → 84%
Latency reduced by ~1.2s

Join graph constraints

Built join graph from DB metadata + query logs
Forced model to only use valid relationships

Impact:

Accuracy: 84% → 88%
~60% reduction in incorrect joins

Business term mapping

Added translation layer:

revenue → net_sales
customer → client_id
orders → sales_order_hdr

Hybrid approach (rules + embeddings) worked best

Impact:

Accuracy: 88% → 90%

SQL self-check and retry

Second pass:

Validate generated SQL against schema and intent
Allow one retry

Impact:

Accuracy: 90% → 91%
Execution success: 81% → 93%

Evaluation change

Switched to:

exact match OR result equivalence
added real production edge cases

Initial accuracy dropped, then stabilized with real signal

What did not work

Larger models: <2% gain
Longer prompts: worse performance
Few-shot examples: inconsistent results
Free schema exploration: more hallucinated joins

Final production metrics

Accuracy: 91%
Execution success: 93%
Avg latency: 3.6s
Human intervention: 31% → 9%

Still unsolved

Complex financial aggregations
Time-based comparisons across inconsistent date fields
Implicit business rules not present in DB

Bottom line

Text-to-SQL at enterprise scale is not a prompt problem.

It is:

retrieval
constraints
evaluation

Without controlling schema exposure and join paths, accuracy plateaus early.

Top comments (1)

Droidment • Apr 27

Very interesting post. Thanks for sharing.