A research-backed proof that vanilla NLP-to-SQL completely fails on cryptic, real-world schemas.
π― Purpose
This project demonstrates why naively pointing an LLM at a database schema and asking it to generate SQL is fundamentally broken for enterprise use cases. The central argument:
Without a semantic metadata layer (ontology, KG, or schema annotations), an LLM is just guessing at business semantics β and it will be wrong.
This aligns with published research showing that Text-to-SQL accuracy drops sharply as schema complexity increases (see: Spider benchmark, BIRD benchmark).
Code for article is here https://github.com/vishalmysore/text-to-sql
πͺ€ The "Failure-by-Design" Enterprise Schema (11 Tables)
The schema intentionally spans 6 business domains with inconsistent naming conventions across each, so no domain's column names give hints about another domain's semantics.
| Domain | Tables | Key Traps |
|---|---|---|
| CRM |
CLIENTS, CONTACT_INFO
|
C_STS_CD=0/1/9 (opaque); X_ENT_ID β C_ID (no FK, different prefix) |
| Staff / Org | STAFF |
S_LVL=1-4 integers; self-referencing S_MGR with no FK |
| Products |
PRODUCTS, CLIENT_PRODUCT
|
PRD_RK=risk rank 1-5 (no labels); PRD_FEE_BP in basis points (unit not in name) |
| Transactions |
TXN_LOG, TXN_AUDIT
|
T_TYPE=P/R/C/S; T_STS=A/P/R/C; C_REFβC_ID (no FK); TA_TXN_REFβT_ID (different name) |
| Finance |
FIN_METRICS, BUDGET
|
VAL_A=gross, VAL_B=net, VAL_C=costs (no context); M_PRD='YYYYMM' string vs B_QTR='YYYYQN' string |
| Compliance |
KYC_REC, ALERT_LOG
|
KYC_STS='R'(Review) clashes with T_STS='R'(Rejected); A_STS='C'(Closed) clashes with C_SEG='C'(Corporate) |
π₯ The 10 Failure Scenarios
Single/Two-Table Failures
| # | Query | Trap | Expected Correct Answer |
|---|---|---|---|
| 1 | "How many active clients do we have?" |
C_STS_CD is INTEGER 1, not 'ACTIVE'
|
4 |
| 2 | "Show me the total financial volume for customer John Doe." | Join CLIENTSβTXN_LOG via C_ID=C_REF β no FK declared |
1700.00 |
| 3 | "List all transactions for high-value clients." |
"high-value" is undefined β LLM must invent threshold |
Acme Corp rows |
| 4 | "What was the gross revenue in January 2024?" |
M_PRD='202401' is a VARCHAR, not a DATE; VAL_A=gross (zero context) |
850000 |
| 5 | "Show me all corporate clients and their transaction count." |
C_SEG='C' not 'Corporate'
|
Acme Corp |
β Complex Multi-Table / Cross-Domain Failures
| # | Query | Trap | Expected Correct Answer |
|---|---|---|---|
| 6 | "Which clients have open AML alerts assigned to compliance staff?" | Polymorphic FK: ALERT_LOG.A_ENT_ID could be any entity; S_LVL is INTEGER not 'COMPLIANCE'
|
MegaFund LLC |
| 7 | "List active clients who hold derivatives products and have approved transactions." | 4-table join: CLIENTSβCLIENT_PRODUCTβPRODUCTSβTXN_LOG; all status codes are integers/chars |
Acme Corp |
| 8 | "Who are the direct reports of the head of the Risk department?" | Self-referencing STAFF.S_MGR; S_LVL=4 means MD/head (no label); wrong column aliases likely |
Frank Osei |
| 9 | "Find clients whose KYC is under review and have pending transactions." |
KYC_STS='R' means Review β T_STS='R' which means Rejected β same letter, opposite semantics |
Jane Smith |
| 10 | "Compare Finance dept planned budget vs actual revenue for Q1 2024." |
BUDGET.B_QTR='2024Q1' vs FIN_METRICS.M_PRD='202401' β different period encodings in same DB |
1050000 |
Execution Flow:
Natural Language Query
β
βΌ
SqlGenerationEngine
ββ buildPrompt(11-table schemaDesc + query)
β
βΌ Tools4AI β processor.query()
Nvidia NEMO LLM
β generates SQL (often hallucinated)
βΌ
DatabaseManager.execute(sql)
ββ H2 executes β result or SQLException
β
βΌ
TestResult (SUCCESS / SQL_ERROR / EMPTY_RESULT / WRONG_ANSWER)
β
βΌ
Console Report
π Running the Demo
Prerequisites
- Java 18+
- Maven 3.8+
- Nvidia NEMO API key (pre-configured in
tools4ai.properties)
Run
cd c:\work\text2sql
mvn clean install
mvn exec:java "-Dexec.mainClass=io.github.vishalmysore.text2sql.Text2SqlFailureDemo"
π Live Test Results (Nvidia NEMO β nemotron-nano-12b-v2-vl)
Results captured against the 11-table enterprise schema using the live Nvidia NEMO API via Tools4AI.
ββββββββββββββββββββββββββ RUNNING SCENARIOS ββββββββββββββββββββββββββ
βΊ 1. Semantic Ambiguity
Generated SQL: SELECT COUNT(*) FROM CLIENTS WHERE C_STS_CD = 'A';
βΊ 2. Cryptic Column
Generated SQL: SELECT SUM(T.T_AMT) AS TOTAL_FINANCIAL_VOLUME
FROM CLIENTS C JOIN TXN_LOG T ON C.C_ID = T.C_REF
WHERE C.C_NAME = 'John Doe';
βΊ 3. Implicit Join Trap
Generated SQL: SELECT * FROM TXN_LOG
WHERE C_REF IN (SELECT C_ID FROM CLIENTS WHERE C_STS_CD = 'HIGH');
βΊ 4. Temporal Encoding
Generated SQL: SELECT SUM(T_AMT) AS GROSS_REVENUE FROM TXN_LOG
WHERE T_STS = 'COMPLETED' AND EXTRACT(MONTH FROM T_TS) = 1
AND EXTRACT(YEAR FROM T_TS) = 2024;
βΊ 5. Cross-Segment Bias
Generated SQL: SELECT C.C_NAME, COUNT(T.T_ID) AS TRANSACTION_COUNT
FROM CLIENTS C JOIN TXN_LOG T ON C.C_ID = T.C_REF
WHERE C.C_SEG = 'CORPORATE' GROUP BY C.C_ID, C.C_NAME;
βΊ 6. Polymorphic FK Trap
Generated SQL: SELECT C.C_NAME FROM CLIENTS C
JOIN ALERT_LOG A ON C.C_ID = A.A_ENT_ID
JOIN STAFF S ON A.A_OWN_ID = S.S_ID
WHERE A.A_ENT_TYPE = 'CLIENT' AND A.A_CAT = 'AML'
AND A.A_STS = 'OPEN' AND S.S_LVL = 'COMPLIANCE';
βΊ 7. Four-Table Join
Generated SQL: SELECT DISTINCT c.C_ID, c.C_NAME FROM CLIENTS c
JOIN CLIENT_PRODUCT cp ON c.C_ID = cp.CP_CLT_ID
JOIN PRODUCTS p ON cp.CP_PRD_ID = p.P_ID
JOIN TXN_LOG t ON c.C_ID = t.C_REF AND p.P_ID = t.P_REF
WHERE c.C_STS_CD = 'ACTIVE' AND p.PRD_CAT = 'DERIVATIVES'
AND t.T_STS = 'APPROVED';
βΊ 8. Self-Referencing Hierarchy
Generated SQL: SELECT S.NM FROM STAFF S
WHERE S.DEPT_CD = 'RISK'
AND S.S_ID IN (SELECT S_MGR FROM STAFF WHERE S_DEPT_CD = 'RISK');
βΊ 9. Ambiguous Status Code Collision
Generated SQL: SELECT C_ID FROM CLIENTS
WHERE C_ID IN (SELECT K_CLT_ID FROM KYC_REC WHERE KYC_STS = 'UNDER REVIEW')
AND C_ID IN (SELECT C_REF FROM TXN_LOG WHERE T_STS = 'PENDING');
βΊ 10. Budget vs Metrics Period Mismatch
Generated SQL: SELECT B.B_DEPT, B.PLANNED, SUM(T.T_AMT) AS ACTUAL
FROM BUDGET B JOIN TXN_LOG T ON B.B_DEPT = T.T_DESK_CD
WHERE B.B_QTR = 'Q1 2024' GROUP BY B.B_DEPT, B.PLANNED;
ββββββββββββββββββββββββββ RESULTS SUMMARY ββββββββββββββββββββββββββββ
π₯ [SQL_ERROR] 1. Semantic Ambiguity
LLM tried: C_STS_CD = 'A' β INTEGER conversion error
Root cause: Status codes are opaque integers. LLM guessed string abbreviation.
β
[SUCCESS] 2. Cryptic Column
LLM tried: JOIN TXN_LOG ON C_ID = C_REF β Correct result (1700.00)
Note: Got lucky β C_REF naming was close enough to C_ID to infer the join.
π₯ [SQL_ERROR] 3. Implicit Join Trap
LLM tried: C_STS_CD = 'HIGH' β INTEGER conversion error
Root cause: 'High-value' has no schema definition. LLM invented 'HIGH' status code.
β [WRONG_ANSWER] 4. Temporal Encoding
LLM tried: TXN_LOG WHERE T_STS='COMPLETED', date filter on T_TS β null result
Root cause: Queried wrong table (TXN_LOG not FIN_METRICS); T_STS='COMPLETED' hallucinated.
β οΈ [EMPTY_RESULT] 5. Cross-Segment Bias
LLM tried: C_SEG = 'CORPORATE' β 0 rows
Root cause: Actual value is 'C'. LLM used full English word.
π₯ [SQL_ERROR] 6. Polymorphic FK Trap
LLM tried: S_LVL = 'COMPLIANCE' β INTEGER conversion error
Root cause: S_LVL is 1-4 integer; LLM used dept name as level label.
Also: A_ENT_TYPE = 'CLIENT' should be 'C'; A_STS = 'OPEN' should be 'O'.
π₯ [SQL_ERROR] 7. Four-Table Join
LLM tried: C_STS_CD='ACTIVE', PRD_CAT='DERIVATIVES', T_STS='APPROVED' β 3 string errors
Root cause: Three opaque codes hallucinated as English words simultaneously.
π₯ [SQL_ERROR] 8. Self-Referencing Hierarchy
LLM tried: SELECT S.NM, S.DEPT_CD β Column not found
Root cause: Column is S_NM not NM; S_DEPT_CD not DEPT_CD. Alias truncation error.
β οΈ [EMPTY_RESULT] 9. Ambiguous Status Code Collision
LLM tried: KYC_STS = 'UNDER REVIEW' β 0 rows
Root cause: Actual value is 'R'. Same 'R' means Rejected in TXN_LOG β LLM cannot distinguish.
β οΈ [EMPTY_RESULT] 10. Budget vs Metrics Period Mismatch
LLM tried: B_QTR='Q1 2024', JOIN BUDGETβTXN_LOG on DEPT=DESK β 0 rows
Root cause: Period format wrong ('Q1 2024' β '2024Q1'). Joined wrong tables entirely.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total Scenarios : 10
β
Successful : 1 (10%)
π₯ SQL Errors : 5 (50%)
β οΈ Empty Result : 3 (30%)
β Wrong Answer : 1 (10%)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CONCLUSION: Without a semantic metadata layer, vanilla Text-to-SQL
fails on cryptic real-world schemas. The LLM is forced to GUESS
column semantics, join keys, and business thresholds. This demo
proves that a KG / ontology layer (e.g. HyperRAG) is necessary
to bridge the gap between natural language and opaque databases.
Live Results β 10 Scenarios, 11 Tables, 1 Correct Answer (10% accuracy)
| # | Scenario | Generated SQL (key error) | Outcome | Why |
|---|---|---|---|---|
| 1 | Semantic Ambiguity | C_STS_CD = 'A' |
π₯ SQL ERROR |
C_STS_CD is INTEGER β tried string 'A'
|
| 2 | Cryptic Column | JOIN TXN_LOG ON C_ID = C_REF WHERE C_NAME='John Doe' |
β SUCCESS | Got lucky β found C_REF join |
| 3 | Implicit Join Trap | WHERE C_STS_CD = 'HIGH' |
π₯ SQL ERROR | Invented string 'HIGH' for an integer column |
| 4 | Temporal Encoding |
T_STS = 'COMPLETED' on TXN_LOG
|
β WRONG ANSWER | Used wrong table entirely; null result |
| 5 | Cross-Segment Bias | C_SEG = 'CORPORATE' |
β οΈ EMPTY | Correct logic, wrong code β 'C' β 'CORPORATE'
|
| 6 | Polymorphic FK Trap |
A_STS = 'OPEN', S_LVL = 'COMPLIANCE'
|
π₯ SQL ERROR |
S_LVL is INTEGER β string conversion error |
| 7 | Four-Table Join |
C_STS_CD = 'ACTIVE', PRD_CAT = 'DERIVATIVES'
|
π₯ SQL ERROR | Two hallucinated string values in one query |
| 8 | Self-Referencing Hierarchy |
SELECT S.NM (wrong alias), S.DEPT_CD
|
π₯ SQL ERROR | Column S.NM doesn't exist β it's S_NM
|
| 9 | Status Code Collision | KYC_STS = 'UNDER REVIEW' |
β οΈ EMPTY | Real value is 'R', not a description |
| 10 | Period Format Mismatch |
B.B_QTR = 'Q1 2024' joins TXN_LOG on B_DEPT = T_DESK_CD
|
β οΈ EMPTY | Period format wrong ('2024Q1'), joined wrong tables |
π Failure Analysis by Type
| Failure Mode | Count | Scenarios | Root Cause |
|---|---|---|---|
| Opaque integer codes treated as strings | 4 | 1, 3, 6, 7 | LLM defaults to English labels for numeric codes |
| Wrong table/column selected | 2 | 4, 8 | Non-obvious column names; wrong join assumptions |
| Full English word for single-char code | 2 | 5, 9 |
'CORPORATE' β 'C'; 'UNDER REVIEW' β 'R'
|
| Cross-domain period format mismatch | 1 | 10 |
'Q1 2024' β '2024Q1'; joined wrong tables |
| Lucky inference | 1 | 2 |
C_REF close enough to C_ID for correct join |
π Key Takeaway
| Approach | Accuracy on Simple Schemas | Accuracy on Complex/Cryptic Schemas |
|---|---|---|
| Vanilla NLP-to-SQL | 60β80% | 10% (this demo β live result) |
| With schema annotations | 70β85% | 40β55% |
| With KG / ontology layer (HyperRAG-style) | 85β95% | 70β85% |
The fix is not a better LLM. With 11 tables, 6 business domains, and systematically opaque naming, Nvidia NEMO achieved 10% accuracy β not meaningfully better than a random guess.
The fix is a semantic layer β a knowledge graph or ontology that maps cryptic column names to business concepts before the LLM ever sees them.
Top comments (0)