A Protobuf for Database Schemas

#programming #ai #productivity #opensource

Every serious system has an interface definition for its wire format. gRPC has protobuf. REST has OpenAPI. GraphQL has its SDL. But databases -- the thing everything else is built on top of -- have nothing.

Your database schema is one of the most important artifacts in your system. It defines every table, column, type, constraint, relationship, and index. It encodes years of domain decisions. And yet there is no standard, portable, machine-readable format for it.

We built one. We call it ctxexport.json.

The problem is older than LLMs

Before you assume this is an AI-context story, consider how many times you have needed your schema outside the database itself:

Onboarding a new engineer who needs to understand the data model.
Diffing staging against production to catch drift before a deploy.
Running a linter in CI to enforce naming conventions or catch missing indexes.
Generating documentation that is not immediately stale.

Every time, you end up writing a bespoke script that queries information_schema or pg_catalog, parses the output, and feeds it into whatever tool you need. The script is Postgres-specific. It breaks when you add a second schema. Nobody maintains it.

pg_dump --schema-only exists, but it is a restore format, not a consumption format. It is Postgres-specific SQL with SET statements, ownership clauses, and an ordering designed for replay, not reading. Try parsing it reliably. Try feeding it to a linter. Try diffing two of them without drowning in noise.

MongoDB is worse. There is no mongodump --schema-only. Your schema lives in the shape of whatever documents happen to exist. Good luck extracting that into something a tool can reason about.

Extract once, use many ways

The core insight behind ctxexport.json is the same one behind protobuf: separate the definition from the consumption.

A protobuf .proto file is written once and compiled to Go structs, Python classes, TypeScript types, gRPC stubs, or REST gateways. The definition is the single source of truth. The consumers are many and varied.

ctxexport.json works the same way. You extract your schema once -- from Postgres, MongoDB, or whatever backend -- and produce a single canonical JSON file. That file contains entities (tables, views, collections), fields (columns with types, nullability, defaults), edges (foreign keys and inferred references), and access paths (indexes). Everything a tool needs to understand your data model, nothing it does not.

From that single artifact, you can:

Compile to a lighthouse map -- a compact table-and-relationship summary that fits in an LLM prompt.
Compile to full SQL DDL -- standard CREATE TABLE statements for any subset of tables.
Serve over MCP -- give an AI agent schema awareness without database credentials.
Diff across environments -- compare staging and production schemas as structured data, not text.
Lint offline -- check naming conventions, missing indexes, or orphaned foreign keys in CI.
Validate in CI -- catch schema regressions before they reach production.
Commit to git -- your schema becomes a versioned artifact with a real history.

None of these consumers need to know whether the source was Postgres or MongoDB. None of them need a live database connection. The extraction happened once, upstream, and everything downstream reads the same contract.

The sidecar pattern

Databases have never been good at carrying human knowledge alongside the schema. Your users.deleted_at column is a soft-delete flag, but the database only knows it is a timestamp with time zone. Your orders.payload column is JSONB with a specific structure, but the database sees an opaque blob.

A sidecar file (dbdense.yaml) layers descriptions and value annotations onto the extracted schema:

entities:
  payments:
    fields:
      status:
        values: ["pending", "authorized", "paid", "failed", "refunded"]
  users:
    fields:
      deleted_at:
        description: "Soft delete timestamp. NULL = active."

This merges at export time. The compiled DDL gets inline comments like -- Values: pending, authorized, paid, failed, refunded. Every downstream consumer -- linter, LLM, documentation generator -- picks it up automatically. Write it once in a YAML file committed to the repo.

Why JSON, not SQL

SQL DDL is human-readable but machine-hostile. Parsing CREATE TABLE statements reliably across dialects is a nightmare. Defaults are quoted differently. Constraints can be inline or out-of-band. Comments use different syntax. There is no standard way to represent a foreign key relationship as structured data.

JSON is boring and that is the point. It is a declarative state representation -- you look up a table by name, not by parsing DDL statement order. Every language has a JSON parser. The schema is simple: a version string, an array of entities, and an array of edges. You can validate it with a JSON Schema. You can diff it with jq. You can read it in any language without a SQL parser.

A minimal entity looks like this:

{
  "name": "payments",
  "type": "table",
  "fields": [
    {"name": "id", "type": "uuid", "is_pk": true},
    {"name": "status", "type": "text", "not_null": true, "values": ["pending", "paid", "failed"]}
  ]
}

Flat, predictable, zero ambiguity.

Stop treating your schema like a black box

The immediate use case is LLM context -- giving AI agents schema awareness without live database access. But the format is deliberately general. If your tool can read JSON, it can read a database schema. That was not true before.

The project is at github.com/valkdb/dbdense. The contract is documented in docs/ctxexport-contract.md. It supports Postgres and MongoDB today. The extractor interface is small enough that adding a new backend is a single file.

Your database schema is too important to be locked inside the database. Export it. Version it. Build on it.