An error is the safe outcome. The dangerous one is a number.
When a text-to-SQL feature breaks the way you hope, it throws a missing column and dies on the screen. You see it, you fix it, nobody was misled.
The failure that should keep you up is the other one. The query runs, returns rows, and hands a real user a number that is wrong, at the moment the stakes are highest. Nothing errored. Just a plausible figure, formatted to two decimals, in front of the person least able to know it's off.
That gap, between an answer that is wrong and one that looks wrong, is the whole problem with pointing an LLM at a production database. It never shows up in the demo. The demo is built to avoid it.
What follows is not a hypothetical. It is a public dataset and a question with a known answer that an LLM gets confidently, catastrophically wrong. The data is the Olist Brazilian e-commerce dataset on Kaggle. You can reproduce every number here yourself.
"Nobody ever comes back"
What is our repeat-purchase rate? How many customers come back and buy again. That number separates a business that compounds from one that re-buys its whole revenue base every quarter.
You wire an "ask your data" box into your product. Someone types the question in plain English, and an LLM turns it into SQL and runs it. Two tables are in scope:
olist_customers_dataset olist_orders_dataset
customer_id order_id (pk)
customer_unique_id customer_id
customer_zip_code_prefix order_status
customer_city order_purchase_timestamp
customer_state ...
Here is the SQL the LLM wrote. No human reviewed it; it ran straight against the database:
SELECT
CASE WHEN order_count > 1 THEN 'repeat' ELSE 'one-time' END AS segment,
COUNT(*) AS customers
FROM (
SELECT c.customer_id, COUNT(DISTINCT o.order_id) AS order_count
FROM olist_orders_dataset o
JOIN olist_customers_dataset c ON c.customer_id = o.customer_id
GROUP BY c.customer_id
) per_customer
GROUP BY segment;
The answer: one-time 99,441. Repeat 0. Repeat rate 0.00%.
Nobody ever comes back. Not one customer out of ninety-nine thousand has placed a second order.
That query, and that 0.00%, are the LLM's own output. Read it like a code reviewer: the join is valid, the aggregation is clean, COUNT(DISTINCT o.order_id) is exactly how you count a customer's orders. By the standard of "is this reasonable SQL," it is reasonable SQL.
It is also wrong, in the most decision-changing direction a retention number can be. And it's wrong for a reason the schema will never tell you.
The column that lies about what it is
Two real values from the customers table:
customer_id 06b8999e2fba1a1fbc88172c00ba8bc7
customer_unique_id 861eff4711a542e4b93843c6dd7febb0
Two 32-character hex strings, identical in form. Nothing signals which is the person and which is the receipt. If you had to guess which means "customer," you'd flip a coin. So does the LLM.
In Olist, customer_id is minted fresh for every order. It is an order key wearing the word "customer" in its name. The actual person, stable across orders, is customer_unique_id.
The customers table has exactly one row per customer_id: 99,441 rows, 99,441 ids. So when you group orders by customer_id, every group has exactly one order, by construction. Max orders per customer_id in the whole dataset: 1.
That is why the rate came back 0.00%. The query computed the repeat rate of orders pretending to be people, which is always exactly zero.
Change one token, group by the column that identifies a person, and the data flips:
SELECT c.customer_unique_id, COUNT(DISTINCT o.order_id) AS order_count
...
GROUP BY c.customer_unique_id -- the only change: _unique_id
The truth: 96,096 real customers, 2,997 repeat buyers, a 3.12% repeat rate. One customer ordered 17 times.
Not a great retention number. But the distance between "3.12% come back" and "nobody comes back" is the distance between a business and a fire.
The simple count breaks too. COUNT(DISTINCT customer_id) says 99,441 customers; COUNT(DISTINCT customer_unique_id) says 96,096. The naive version overstates the base by 3,345 people, about 3.5%.
The count is wrong by a little. The retention answer is wrong by everything.
A bigger model is not the fix
The reflex is to reach for a better LLM. So I ran a small test. Not a benchmark, a sanity check.
Two open-source models, Llama 3.3 70B and Qwen 2.5 Coder 32B, both solid SQL writers, each asked the repeat-rate question several times, with only the raw schema to go on.
Both did the obvious thing: grouped by customer_id, reported 0.00% most of the time. The bigger model was not meaningfully safer than the smaller one.
A frontier model with the same schema and the same blind spot makes the same mistake with more fluency. The variable is not the model. It is what the model is allowed to know, and what it is allowed to do.
Why the demo never caught it
A demo tests questions a human already validated by eye. "Orders last month," fine. "Revenue by state," fine. Each one happened to dodge the customer_id versus customer_unique_id distinction.
The retention question was the first to force the choice. By then the only reader was someone who typed a sentence and could not see the SQL.
The LLM makes a modeling decision, which column is a customer, fresh on every query, with only the schema to go on. A column named customer_id does not announce that it is secretly an order key. That fact lives in someone's head, or nowhere.
The tooling makes it concrete: the interface here is an MCP server in raw-SQL mode, exposing a read-only execute_sql tool. No layer sits between the model's guess and the database. It runs the bad query and returns 0.00%, because running well-formed read-only SQL is its whole job.
Use the simplest abstraction the workload tolerates
None of this makes text-to-SQL a mistake. It makes it a choice with a domain.
Poking around Olist yourself on a Tuesday? Raw SQL over MCP is great. When the rate comes back 0.00%, you'll notice, because 0% is absurd, and you'll go find the _unique_id column. The blast radius is one person who can distrust the number.
Internal versus external is the obvious line, and it's the wrong one. The example above was internal, someone querying their own store, and the number could still walk into an investor update.
The real axis is who consumes the answer and whether they can sanity-check it. A human exploring squints at 0.00% and smells something wrong. Software does not. A dashboard tile, or an agent that reads the number and acts on it, has no instinct. "Looks plausible" is not a correctness check.
And 0.00% is the cruelest case: on a retention dashboard it doesn't look implausible. It looks like bad news. Nobody files a bug to report bad news.
So, the principle worth keeping even if you never touch Gaur: a confident wrong answer is what you get when you bring an exploration-grade abstraction to a consumption-grade workload. The mismatch is the bug, not the query engine.
Two layers: one helps, one enforces
This is the problem we work on at Gaur. Gaur (gaur.run) is a developer-first backend for data applications: connect your scattered sources, model them, and expose them to your app over REST, an OpenAI-compatible chat endpoint, and MCP, without standing up a warehouse and a data team first.
It gives you two ways to push back on the wrong answer above. They are not equally strong.
The cheap layer: tell the model what the schema can't
The schema failed to carry meaning, so carry it. In Gaur, every table, model, and contract holds author-written context that gets embedded and surfaced to chat and MCP. For the customers table, it says the one thing the column names hide:
## Field descriptions
- customer_id — A per-ORDER id, minted fresh for every order. NOT a person.
- customer_unique_id — The actual person. Stable across orders.
## Rules
- For any retention / repeat / per-customer metric, aggregate by
customer_unique_id. Grouping by customer_id makes every customer look
new, so the repeat rate computes as 0%. The true rate is 3.12%.
I re-ran the same test with that context in front of both models. The improvement was real: both reached for customer_unique_id much more often, and returned 3.12% where they had returned 0.00%.
But be precise, because "context fixes it" is the overclaim this essay argues against. Context helps. It does not enforce. The model still chooses; you've just made the right choice the obvious one. On the runs where it still picked customer_id, the answer was still 0.00%, still confident, still wrong.
The enforcing layer: take the choice away
The only way to make 0.00% unreachable is to stop exposing the column that produces it. That is what a contract does.
First, a model collapses orders to one row per real person:
-- Model: olist_customer_orders. Grain: one row per customer_unique_id.
SELECT c.customer_unique_id, count(DISTINCT o.order_id) AS order_count
FROM olist_orders_dataset o
JOIN olist_customers_dataset c ON c.customer_id = o.customer_id
GROUP BY c.customer_unique_id
Then a contract over that model exposes only declared measures and dimensions, all defined on customer_unique_id:
{
"name": "customer_retention",
"sources": [{ "source_type": "model", "name": "olist_customer_orders", "alias": "co" }],
"measures": {
"customers": { "sql": "count(distinct co.customer_unique_id)", "type": "number" },
"repeat_customers": { "sql": "sum(case when co.order_count > 1 then 1 else 0 end)", "type": "number" },
"repeat_rate": { "behavior": "derived", "numerator": "repeat_customers", "denominator": "customers" }
}
}
The point: customer_id is not in this contract at all.
A consumer calling it, over REST, chat, or MCP, picks from declared measures and dimensions. It never names a column, never writes a GROUP BY, never sees that customer_id exists. It cannot reproduce 0.00%, because the building block is not on the shelf. The contract returns customers = 96,096, repeat_customers = 2,997, repeat_rate = 0.0312.
So I ran the test a third time. I published this contract to a contract-only consumer and pointed the same two models at it. Both returned 3.12% every time. There was no wrong column left to reach for.
One model still hallucinated, but it named a field the contract didn't define, and the contract rejected it. That is the behavior you want: the failure that errors, not the failure that lies.
Two honest caveats. The SQL-only demo from the earlier story doesn't expose contracts at all; that consumer's job is raw SQL, which is exactly why the model could reach the wrong column. Whether a caller can write SQL is a property of the consumer's query mode. Same platform, two postures.
And I won't claim contracts make wrong answers impossible. A measure that aggregates across the wrong join can still publish; you catch it by testing before you ship. What a contract changes is where the decision lives: the hard call, which column is a person, gets made once by someone who can reason about the schema, instead of re-guessed on every question by a model holding two indistinguishable hex strings.
Check it yourself
Don't take my word for it. Olist is a free download. Run the count both ways:
SELECT COUNT(DISTINCT customer_id) FROM olist_customers_dataset; -- 99,441
SELECT COUNT(DISTINCT customer_unique_id) FROM olist_customers_dataset; -- 96,096
Then compute the repeat rate both ways and watch the second number contradict the first. One says nobody comes back. The other says 2,997 people did, one of them seventeen times. Nothing errors. Nothing warns you. The database hands you both answers with the same confidence, and only one is true.
The part we build (shameless plug)
That 0.00% is not special to retention or to Olist. It is the default failure for any LLM pointed at real business data: a support copilot answering questions about an account, an agent flow that reads a metric and then acts on it, customer-facing analytics, a reporting API. The query runs, the number is wrong, and software downstream believes it.
Contracts are how you put an LLM in front of that data without handing it a blank SQL prompt. The model composes answers from a fixed set of measures and dimensions instead of guessing columns. Row-level security scopes every caller the same way across REST, chat, and MCP. An undeclared field errors instead of lying. It doesn't make every answer right; it helps make the confident-wrong one structurally hard to ship.
If you're building a chatbot or an agent flow over your own business data and you don't want it shipping numbers you can't trust, that is exactly what we're building Gaur for. Book a demo at gaur.run.
Top comments (0)