Sola Samuel

Posted on Jun 14

The --schema-only flag that makes enterprise customers comfortable with AI

#ai #llm #typescript #database

Every enterprise conversation about AI hits the same wall, usually within the first 30 minutes:

"This looks great. But we can't give you access to our production data."

And they're right to say it. Their data is regulated, customer-owned, or both. So the demo dies before it starts — not because the AI can't do the job, but because nobody can safely show that it can.

I ran into this enough times that I built a CLI around the answer to that objection. It's called claude-query — you point it at a data source (PostgreSQL, CSV, JSON, Notion, Airtable), ask a question in plain English, and Claude constructs and runs the query. Standard text-to-SQL, more or less.

The interesting part isn't the happy path. It's a single flag: --schema-only.

The core idea: the model needs the shape of your data, not the data

To turn "which customers signed up last month but never bought anything?" into SQL, a model needs to know:

what tables exist (customers, orders, …)
what columns they have and their types (created_at timestamp, customer_id int, …)
how they relate (the foreign key from orders.customer_id → customers.id)

Notice what's not on that list: any actual customer, any actual order, any actual row. The model reasons over the structure. The data itself is irrelevant to writing the query.

That's the whole insight. Query construction is a schema problem, not a data problem.

What "schema-first" looks like in normal mode

By default, my Postgres adapter extracts a rich schema to give the model good context — including up to 3 sample distinct values per column, so it can recognise enum-like fields (a status column that's only ever active/churned/trial). That sampling uses TABLESAMPLE on large tables so it stays cheap.

Those sample values are genuinely useful... and they are also real data. A sample of the email column is three real customer emails. That's exactly the thing a regulated customer can't let leave their environment.

So the flag has to do two things.

What `--schema-only` actually changes

1. It strips every row-derived value from the schema before anything is sent.

The redaction is deliberately boring — boring is good for a security boundary:

export function redactSchemaForGovernance(schema: SchemaContext): SchemaContext {
  return {
    ...schema,
    tables: schema.tables.map((table) => ({
      ...table,
      columns: table.columns.map((col) => {
        // Drop sampleValues, min, max; keep name, type, references.
        const { sampleValues, min, max, ...structural } = col;
        return { ...structural };
      }),
    })),
  };
}

Table names, column names, types, foreign keys, and row counts survive (a count is aggregate metadata, not a record). Sample values, min/max ranges — anything derived from an individual row — is gone.

2. It swaps the tool so the model can't execute anything.

claude-query uses Claude's tool-use API rather than asking for SQL as text. In normal mode the model is given an execute_query tool. In --schema-only mode it's given a describe_query tool instead — same shape, but the tool describes the query it would run without ever executing it. There's no code path from schema-only mode to your database.

So you can run this against a production connection string you can't see the contents of:

claude-query --source postgres://prod/customers --schema-only \
  "Which customers churned last quarter — bought before but not since?"

…and get back the exact SQL the model would run, with zero rows read and zero rows transmitted.

"Trust me" isn't a security guarantee — so I tested the boundary

Here's the part I'm most happy with. It's easy to claim no data leaks. It's better to assert it.

The test seeds the schema with a sentinel value — a string that could only appear if redaction failed — and then checks that it appears nowhere in the outbound payload to the model:

const SENTINEL = "SENTINEL_ROW_VALUE_42";
// ...schema seeded with sampleValues: [SENTINEL, ...], min: SENTINEL

const params = create.mock.calls[0][0]; // the actual request sent to Claude
expect(JSON.stringify(params)).not.toContain(SENTINEL);

If a future refactor ever reintroduces a leak — a new schema field that carries data, a serialisation that forgets to redact — this test goes red. The governance guarantee is regression-protected, not aspirational.

I'll be honest about one wrinkle I hit building the audit helper: my first version flagged the digit 1 as a "leak" because it appeared inside a row-count like 1200. Short values collide with structural text. The fix was to only treat distinctive values (length ≥ 4) as leak signals — real PII like emails, names, and dates clears that bar; an incidental 1 doesn't. Worth knowing if you build something similar.

Why this is the part that actually matters commercially

You can frame the same flag two completely different ways depending on who's listening:

To a developer: "it's a dry-run that also redacts the schema."
To a CISO or a data-governance lead: "the model receives the structure of your database and never a single row of customer data — and here's the test that proves it."

The second framing is the one that unblocks the deal. The technical objection ("can AI understand our data?") was never the real blocker. The governance objection was. --schema-only lets you demonstrate capability and respect the data boundary in the same breath.

That's the lesson I keep relearning: in enterprise AI, the constraint is rarely the model. It's the data boundary around the model. Build for the boundary first and the demo gets to happen at all.

Try it

npm install -g claude-query
export ANTHROPIC_API_KEY=sk-ant-...

claude-query --source postgres://localhost/shop --schema-only \
  "Top 10 customers by lifetime spend"

Code and the full per-adapter setup guide: github.com/solasamuel/claude-query-cli

Built with TypeScript, Commander.js, and the Claude API's tool-use capability. Next on the list: MongoDB support — the whole thing is built around a single DataSourceAdapter contract, so a document store is a new adapter, not a rewrite.

If you've shipped AI into a regulated environment, I'd genuinely like to hear how you handled the data-access conversation — that's the part nobody writes about.

DEV Community

The --schema-only flag that makes enterprise customers comfortable with AI

The core idea: the model needs the shape of your data, not the data

What "schema-first" looks like in normal mode

What `--schema-only` actually changes

"Trust me" isn't a security guarantee — so I tested the boundary

Why this is the part that actually matters commercially

Try it

Top comments (0)

The core idea: the model needs the shape of your data, not the data

What "schema-first" looks like in normal mode

What --schema-only actually changes

"Trust me" isn't a security guarantee — so I tested the boundary

Why this is the part that actually matters commercially

Try it

What `--schema-only` actually changes