Your Test Data Is Type-Correct and Still Invalid: 6 Postgres Schema Features Generators Skip

Mikhail Shytsko — Mon, 01 Jun 2026 19:25:26 +0000

Your Test Data Is Type-Correct and Still Invalid: 6 Postgres Schema Features Generators Skip

TL;DR: Composite primary keys, partial unique indexes, cross-column CHECK constraints, JSONB shape, GENERATED ALWAYS columns, and row-level security all reject type-correct data, because column types are not what your schema actually enforces.

A few months ago I watched a seed run finish with a clean green summary: every column populated, every type correct, a few thousand rows inserted. The first integration test then failed on an INSERT the application itself ran. The generated data was valid the way a sentence with correct grammar can still be a lie. Each value matched its column type. The combination of values broke a constraint the generator never looked at.

That gap has a simple cause. A column type is a per-column promise: this is an integer, this is text, this is jsonb. Most of what a real schema enforces is not per-column. It lives one level up: across columns in a row, across rows in a table, or across the role doing the writing. A generator that thinks in columns produces data that is type-correct and still invalid.

Here are six places that gap shows up in Postgres, what each one actually enforces, and a query you can run to see whether your own generated data respects it.

1. Composite primary keys: the tuple is unique, not the columns

A composite primary key enforces uniqueness over the combination of columns, not over each column on its own. The docs put it plainly: "the combination of values in the indicated columns is unique across the whole table, though any one of the columns need not be (and ordinarily isn't) unique."

CREATE TABLE enrollment (
    student_id integer,
    course_id  integer,
    PRIMARY KEY (student_id, course_id)
);

A column-by-column generator handles this badly in two opposite ways. It either treats student_id as a unique key and never lets a student enroll in two courses, or it generates both columns independently and produces duplicate (student_id, course_id) pairs that collide on insert. Both are wrong, and the second one only surfaces once enough rows exist to cause a collision, usually in CI rather than on a laptop with ten rows.

There is a second trap here: a primary key forces every participating column to NOT NULL. Adding a primary key "will force the column(s) to be marked NOT NULL," so a generator that emits an occasional NULL for a nullable-looking integer will fail against a column it didn't realize was mandatory.

To count the duplicate tuples your data would reject:

SELECT student_id, course_id, count(*)
FROM enrollment
GROUP BY student_id, course_id
HAVING count(*) > 1;

If that returns any rows, your generator is treating a tuple constraint as a set of column constraints.

2. Partial unique indexes: uniqueness with a WHERE clause

This is the one I see missed most often, because it isn't a constraint at all. It's an index, and generators that introspect constraints never see it. A partial unique index enforces uniqueness "among the rows that satisfy the index predicate, without constraining those that do not."

CREATE UNIQUE INDEX one_active_subscription
    ON subscriptions (user_id)
    WHERE status = 'active';

That index says: a user may have many subscriptions, but only one active one. A generator that produces realistic-looking subscription histories, several rows per user with a mix of statuses, will happily hand two of them status = 'active' and hit a unique violation that exists only for the active subset. Nothing in the column types hints at it. Nothing in the foreign keys hints at it. The rule lives in a WHERE clause on an index.

Diagnostic, to find the predicate subset that would collide:

SELECT user_id, count(*)
FROM subscriptions
WHERE status = 'active'
GROUP BY user_id
HAVING count(*) > 1;

Worth knowing: you cannot express this as a table constraint with ALTER TABLE ... ADD CONSTRAINT. Partial uniqueness only exists through CREATE UNIQUE INDEX ... WHERE, which is exactly why constraint-only introspection misses it.

3. CHECK constraints: the rule that spans two columns

A CHECK constraint can reference more than one column in the same row, and that cross-column form is where generated data falls down. A per-column generator picks each value in isolation, so it has no way to satisfy a rule that relates two of them.

CREATE TABLE bookings (
    starts_at timestamptz NOT NULL,
    ends_at   timestamptz NOT NULL,
    CHECK (ends_at > starts_at)
);

Generate starts_at and ends_at independently from a plausible date range and roughly half your rows will have an end before the start. Every value is a valid timestamp. The row is still rejected.

Two details that bite specifically during seeding:

NULL passes the check. A CHECK is satisfied when its expression is true or null. The columns above are NOT NULL, so it can't bite in this example, but the moment a checked column is nullable, CHECK (ends_at > starts_at) passes on every row where ends_at is null. On a nullable schema, write the diagnostic as WHERE col IS NULL OR NOT (...) so those rows aren't silently skipped.
NOT VALID constraints lie about coverage. A constraint added NOT VALID is enforced for new rows immediately but never checked against existing rows until you run VALIDATE CONSTRAINT. If you seed into a table that has a NOT VALID check, the seed is held to the rule even though the old data isn't.

Run the constraint expression as a query and count the violations:

SELECT count(*) FROM bookings WHERE NOT (ends_at > starts_at);

4. JSONB: the column type that enforces almost nothing

A jsonb column guarantees one thing: the value is valid JSON. It does not enforce keys, required fields, or value types. The structure "is typically unenforced," in the documentation's words. The shape your application depends on lives entirely in application code, not in the column.

This is a problem for generators in both directions. A naive generator drops '{}' or a random string-keyed blob into the column, it's valid JSON, the insert succeeds, and the first code path that reads payload->>'amount' gets null and breaks far away from the cause. A generator that knows the column is jsonb still has no schema to generate against, because Postgres never had one to give it.

You can pull some of the contract back into the database with a CHECK:

ALTER TABLE events
    ADD CONSTRAINT events_payload_shape
    CHECK (
        payload ? 'type'
        AND payload ? 'amount'
        AND jsonb_typeof(payload -> 'amount') = 'number'
    );

The explicit payload ? 'amount' test is doing real work. Drop it and a row with no amount key passes anyway, because jsonb_typeof(payload -> 'amount') returns SQL NULL on a missing key, and a CHECK is satisfied by NULL. It's the same trap as Section 3, hiding in a different operator.

Diagnostic, with absence treated as a violation rather than skipped over:

SELECT count(*) FROM events
WHERE payload IS NULL
   OR NOT (payload ? 'type')
   OR jsonb_typeof(payload -> 'amount') IS DISTINCT FROM 'number';

IS DISTINCT FROM is the key move: unlike = 'number', it returns true (a violation) when the left side is NULL, so a missing amount is counted instead of silently passing. If you have no such CHECK and no such query, your generated JSON is "valid" only in the sense that Postgres accepts it, not in the sense that your application will.

5. Generated columns: the value you must not write

A generated column is computed from other columns, and the docs are blunt about it: "A generated column cannot be written to directly." Try to insert one anyway and Postgres returns cannot insert a non-DEFAULT value into column.

CREATE TABLE line_items (
    quantity    integer NOT NULL,
    unit_price  numeric NOT NULL,
    total_price numeric GENERATED ALWAYS AS (quantity * unit_price) STORED
);

A generator that builds its column list by reading every column in the table will try to insert total_price and fail on the first row. A generator that simply skips unknown columns might omit a column that isn't generated. Knowing which columns are write-protected is the difference, and it is not visible from the column type. As far as the type system is concerned, total_price is just a numeric.

There's a version trap here worth flagging, because it changed recently. STORED generated columns arrived in Postgres 12, and through Postgres 17 the keyword was required: omit it and you get a syntax error, so every generated column on those versions is stored. Postgres 18 made the keyword optional and virtual the default, so an unqualified GENERATED ALWAYS AS (...) is now a virtual column computed at read time. That bites on upgrade. Virtual columns can't be indexed yet (planned for a later release), so a column that used to be STORED and indexed becomes un-indexable the moment someone drops the keyword. Always write STORED explicitly when you mean stored. The no-direct-write restriction applies to both kinds.

List the generated columns a writer must never populate:

SELECT column_name, is_generated, generation_expression
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name = 'line_items'
  AND is_generated = 'ALWAYS';

6. Row-level security: "valid data" depends on who is writing

Row-level security is the feature that breaks the assumption underneath all the others, namely that whether a row is valid is a property of the row. Under RLS it is a property of the row and the role doing the write.

ALTER TABLE accounts ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON accounts
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::int)
    WITH CHECK (tenant_id = current_setting('app.tenant_id')::int);

The WITH CHECK clause is applied to every INSERT and UPDATE: a row whose tenant_id doesn't match the current tenant is rejected even though every value in it is type-correct and constraint-clean. Two facts compound this for anyone generating data:

The table owner bypasses RLS by default. Seed as the owner and every policy stays silent, so the data looks fine until the application connects as a normal role and the same rows turn out to be invisible or the same inserts get refused. Unless the table has FORCE ROW LEVEL SECURITY, your seed never exercised the policy.
Superusers and BYPASSRLS roles skip policies entirely. Seeding scripts often connect as exactly these privileged roles, so generate through one and you've tested nothing about the rules that govern real traffic.

Diagnostic, to see which tables have policies your seed role might be quietly bypassing:

SELECT schemaname, tablename, policyname, cmd
FROM pg_policies
ORDER BY tablename;

If those tables matter and your generator connects as the owner, your "valid" rows were never measured against the rules that decide validity in production.

Where the tools actually sit

None of this means generated test data is a bad idea. It means the question to ask a generator is not "does it produce realistic values" but "how much of the schema does it treat as input." Roughly three tiers exist today:

Free and DIY. Faker, ORM seeders, and hand-written scripts generate values per column. Relationships, table-level constraints, and the features above stay your job, in your code, kept in sync by hand.
Schema-aware generators. A newer middle tier reads the schema and treats its structure as a first-class input. Neosync and Seedfast are two of several tools that take this approach in different ways. The honest trade-off is that "schema-aware" is a spectrum. Foreign keys and NOT NULL are widely handled, while partial unique indexes, cross-column CHECKs, and RLS are exactly where coverage varies between tools, so it's worth testing each against your own constraints rather than trusting the label.
Enterprise TDM platforms. Anonymization and masking suites that transform production data. They cover a lot, at the cost of needing production access and a setup measured in weeks.

The useful question isn't which price tier a tool sits in, it's how far past column types it actually looks. Run your real constraints through any candidate before you trust it, whatever tier it claims.

A quick decision table

If your schema relies on...	Before trusting generated rows, check...
Composite primary keys	duplicate tuples, and `NULL` in any PK column
Partial unique indexes	uniqueness only within the predicate subset
Cross-column CHECKs	the expression as a `WHERE NOT (...)` query, adding `col IS NULL OR` for nullable columns
JSONB shape	required keys and value types via a CHECK or a probe query
`GENERATED ALWAYS` columns	the writer skips them;`STORED` is explicit on PG 18+
Row-level security	the seed role isn't the owner or `BYPASSRLS`

When none of this matters

Plenty of test data doesn't need to clear this bar. A unit test that touches one flat table is better served by a three-line fixture than by anything that introspects a schema. Not every column needs realistic values; sometimes 'x' and 1 are the honest choice because the test doesn't care. And there's a class of rules no generator can infer, the domain invariants that live only in your head or your application, like "a refund row must point at a captured payment." Those you encode yourself, or you assert in the test.

Generators aren't the problem here. The trouble is that "the insert succeeded" and "the data is valid" are two different claims, and the gap between them lives in exactly the schema features that never show up in a column type. Run the six queries above against your own generated data before you trust the green summary.

DEV Community: Mikhail Shytsko

Your Test Data Is Type-Correct and Still Invalid: 6 Postgres Schema Features Generators Skip

Your Test Data Is Type-Correct and Still Invalid: 6 Postgres Schema Features Generators Skip

1. Composite primary keys: the tuple is unique, not the columns

2. Partial unique indexes: uniqueness with a WHERE clause

3. CHECK constraints: the rule that spans two columns

4. JSONB: the column type that enforces almost nothing

5. Generated columns: the value you must not write

6. Row-level security: "valid data" depends on who is writing

Where the tools actually sit

A quick decision table

When none of this matters