We measured 72% → 91% accuracy on Text-to-SQL over a 600-table ERP - what actually fixed it
We deployed a Text-to-SQL agent on a legacy ERP with 612 tables, ~8,400 columns, and inconsistent naming across modules. Initial offline eval looked acceptable. Production said otherwise.
Baseline
Model: GPT-4 class via API
Context: full schema dump + user query
Prompt: standard "generate SQL from question"
Eval set: 220 business questions from analysts
Results:
- Exact match accuracy: 72%
- Execution success: 81%
- Latency: ~4.8s avg
- Production failure rate (week 1): 31% required human correction
What failed
Schema overload
600+ tables in context diluted attention
Wrong joins between similarly named tables like orders vs order_hdr vs order_archiveSemantic mismatch
Business language did not match schema
Example: "revenue" mapped to total_amount instead of net_salesJoin path ambiguity
Multiple valid joins existed
Model picked shortest path, not correct oneSilent logical errors
Queries executed but returned wrong numbers
Hardest to detect
What moved accuracy to 91%
This was not fixed with prompting alone. We changed the system.
- Schema retrieval layer
Instead of passing full schema:
- Embedded table names, column names, sample values
- Retrieved top 8–15 relevant tables per query
Impact:
- Accuracy: 72% → 84%
- Latency reduced by ~1.2s
- Join graph constraints
Built join graph from DB metadata + query logs
Forced model to only use valid relationships
Impact:
- Accuracy: 84% → 88%
- ~60% reduction in incorrect joins
- Business term mapping
Added translation layer:
- revenue → net_sales
- customer → client_id
- orders → sales_order_hdr
Hybrid approach (rules + embeddings) worked best
Impact:
- Accuracy: 88% → 90%
- SQL self-check and retry
Second pass:
- Validate generated SQL against schema and intent
- Allow one retry
Impact:
- Accuracy: 90% → 91%
- Execution success: 81% → 93%
- Evaluation change
Switched to:
- exact match OR result equivalence
- added real production edge cases
Initial accuracy dropped, then stabilized with real signal
What did not work
- Larger models: <2% gain
- Longer prompts: worse performance
- Few-shot examples: inconsistent results
- Free schema exploration: more hallucinated joins
Final production metrics
- Accuracy: 91%
- Execution success: 93%
- Avg latency: 3.6s
- Human intervention: 31% → 9%
Still unsolved
- Complex financial aggregations
- Time-based comparisons across inconsistent date fields
- Implicit business rules not present in DB
Bottom line
Text-to-SQL at enterprise scale is not a prompt problem.
It is:
- retrieval
- constraints
- evaluation
Without controlling schema exposure and join paths, accuracy plateaus early.
Top comments (1)
Very interesting post. Thanks for sharing.