Forrest Miller

Posted on Jun 1 • Originally published at bingwow.com

1,071 of our 1,190 moderation-queue cards were test pollution. Here's the 5-layer fix.

#postgres #testing #webdev #architecture

The bug we shipped twice

Our test fixtures leaked into production. Twice.

The first leak was visible. 117 QA cards reached /cards/education/science because their titles looked enough like real cards to slip past the title-pattern heuristic our public surfaces used. We cleaned them up in May.

The second leak was invisible until I audited the moderation queue last week. 1,071 of 1,190 cards waiting for human review were automation pollution — Playwright fixtures, a QA 65678 test account, timestamped runs named PW Test, Security Test, Mobile PW, plus high-volume anon seeds like Road Trip Bingo (×217) and Test Merged Create (×128). They had been queued up for the AI moderator as if a human had submitted them.

Neither leak was a bug in any single code path. The bug was that "is this a test row?" was a guess every consumer made on its own. So I wrote down the rule once and enforced it five ways.

This is the playbook, with the actual Postgres + TypeScript that runs in BingWow.

Why "is this a test row?" is hard

Test fixtures look exactly like real records. They use the same INSERT path. They carry the same shape. The only difference is intent: a human typed one, a Playwright spec typed the other.

Pre-2026, our public surfaces filtered tests with title heuristics — strings like QA test card or [playwright] were dropped at query time. That worked until a Shiplight test forgot to include the magic substring. Then the row was indistinguishable.

The fix Stripe shipped is the canonical reference: every record carries a livemode boolean set at INSERT time by whichever code path created it. Public surfaces filter on it; analytics segment by it; cleanup automates on it.

We needed the same shape, expanded to an enum-of-strings because we have more origins than test/prod.

Layer 1: the origin column (NOT NULL + CHECK)

ALTER TABLE card_templates
  ADD COLUMN origin text NOT NULL DEFAULT 'legacy_unknown';

ALTER TABLE card_templates
  ADD CONSTRAINT card_templates_origin_check
  CHECK (origin IN ('ai', 'user', 'admin', 'test'));

Four values. ai is the cron pipeline that invents brand-new cards from trending topics. user is everything a human submits at our create page. admin is hand-curated. test is every automation row.

Why text + CHECK instead of a native ENUM: ALTER TYPE … ADD VALUE acquires ACCESS EXCLUSIVE on every table using the type, which blocks writes during the migration. text + CHECK uses SHARE UPDATE EXCLUSIVE. Concurrent writes still work, the validation is the same.

After backfill, the column is the discriminator the codebase had been faking. creator_id IS NULL previously conflated AI cards with anonymous human cards (anonymous users have null creator_id too — that conflation broke moderation for months before we caught it).

Layer 2: the Postgres CHECK that makes the bad state impossible

ALTER TABLE card_templates
  ADD CONSTRAINT card_templates_no_published_test_origin
  CHECK (NOT (origin = 'test' AND status = 'published'));

This is the load-bearing line.

Any INSERT or UPDATE that would set origin='test' AND status='published' returns 23514 check_violation. Postgres refuses the write. The application code doesn't get a chance to do the wrong thing.

Application-level guards are necessary but not sufficient. A direct service-role write from a test fixture bypasses every TypeScript type-check. A future contributor who copy-pastes a working INSERT block from another file inherits whatever assumptions that file made. The database constraint is the only thing every write path passes through.

Layer 3: `isTestTraffic()` at every INSERT site

export function isTestTraffic(
  request: Request,
  body?: { is_automated?: boolean },
  userId?: string | null,
): boolean {
  if (body?.is_automated === true) return true;
  if (isInternalRequest(request, userId)) return true;

  // THE rule. Real user iff bingwow_anon_id cookie present.
  const cookie = request.headers.get('cookie') ?? '';
  return !/(^|;\s*)bingwow_anon_id=/.test(cookie);
}

Three card-creation routes call this on every POST: card creation, fork-and-start, and clone. The origin is computed at the row's birth — isTestTraffic(request, body, userId) ? 'test' : 'user' — and written into the same INSERT.

The cookie rule is the strongest signal. Every real visitor's first GET sets bingwow_anon_id. Playwright contexts launch without it. Headless Chrome and CDP-stealth fingerprints stay caught even when scripts try to set is_automated: false.

For forks of forks of forks, a small helper propagates the tag:

function deriveForkOrigin(parentOrigin: string): string {
  return parentOrigin === 'test' ? 'test' : 'user';
}

So an organic-looking fork tree rooted in a test card stays origin='test' all the way down. When the cleanup cron reaps the root, the whole tree goes with it.

Layer 4: the daily cleanup cron

// app/api/cron/cleanup-test-cards/route.ts
const { error } = await supabase
  .from('card_templates')
  .delete()
  .eq('origin', 'test')
  .lt('created_at', sevenDaysAgo);

It runs at 06:00 UTC every day. It only ever touches origin='test' rows older than seven days. A user-submitted card with origin='user' is never even considered, no matter how thin or how recently submitted, because a user's saved work is sacred.

The CHECK from Layer 2 means a test row can never have status='published', so the cleanup never deletes a publicly-indexed page. We sidestep the entire class of "the cleanup cron took down a popular card" incident.

Layer 5: the CI lint that catches the next mistake

// __tests__/test-fixtures-marked.test.ts
const FIXTURE_DIRS = ['shiplight-tests', 'e2e'];

it('every card_templates POST in test fixtures includes origin: test', () => {
  for (const file of allTestFiles(FIXTURE_DIRS)) {
    const text = readFileSync(file, 'utf-8');
    if (text.includes('card_templates') && /POST|insert\(/.test(text)) {
      expect(text, file).toMatch(/origin:\s*['"]test['"]/);
    }
  }
});

The runtime layers catch a malicious or careless write. The lint catches the next test author who forgets the rule before their PR even gets reviewed.

This was the missing piece. Test files bypass /api/templates/create — they POST directly with a service-role key, so the TypeScript types don't cover them. The lint is what closes that hole.

What the rule looks like once you write it down

A row is a test row iff it was created by automation.
Test rows MAY exist in the database.
Test rows MUST NOT be public.
Test rows ARE deleted after 7 days.

Five layers because each layer enforces one of those clauses in a place the others can't reach:

Clause	Layer
Test rows MAY exist	`card_templates.origin` column (NOT NULL + CHECK)
Test rows MUST NOT be public	`card_templates_no_published_test_origin` CHECK
Created by automation = tagged	`isTestTraffic(request, body, userId)` at every INSERT
The tag propagates	`deriveForkOrigin(parent.origin)` on every fork
New code obeys the rule	`__tests__/test-fixtures-marked.test.ts` CI lint

Drop any one of those and the leak comes back through whatever hole that layer was covering. The CHECK without isTestTraffic means tests are unrestricted because nothing tags them. isTestTraffic without the CHECK means a single buggy UPDATE statement can flip a tagged row to status='published'. The cleanup cron without the propagation rule reaps the parent but orphans the forks.

The lint is the layer that ships next month, because the others stop existing failures and the lint stops the failure that hasn't been written yet.

What changed for users

Nothing visible. The product on the play surface looks identical.

What changed is that I stopped writing post-mortems about test data on production. The 2026-05-07 leak was the second one. After Layer 2 shipped, there were zero.

A structural defense is the only kind that scales. Be-more-careful does not scale; documentation does not scale; code review does not scale to every junior contributor's first PR. A Postgres CHECK scales to every write your application will ever make, including the ones written by the contributor who hasn't been hired yet.

If you have a _test table or a dev_mode boolean nobody is sure does anything, you have the same gap we did. The 2-line Stripe-style CHECK is the fastest cleanup you will ever ship.

Architecture details: our research page has the longer write-ups. The product is a free multiplayer bingo game; the source pattern above runs against every card that lands in our database.

DEV Community

1,071 of our 1,190 moderation-queue cards were test pollution. Here's the 5-layer fix.

The bug we shipped twice

Why "is this a test row?" is hard

Layer 1: the origin column (NOT NULL + CHECK)

Layer 2: the Postgres CHECK that makes the bad state impossible

Layer 3: `isTestTraffic()` at every INSERT site

Layer 4: the daily cleanup cron

Layer 5: the CI lint that catches the next mistake

What the rule looks like once you write it down

What changed for users

Top comments (0)

The bug we shipped twice

Why "is this a test row?" is hard

Layer 1: the origin column (NOT NULL + CHECK)

Layer 2: the Postgres CHECK that makes the bad state impossible

Layer 3: isTestTraffic() at every INSERT site

Layer 4: the daily cleanup cron

Layer 5: the CI lint that catches the next mistake

What the rule looks like once you write it down

What changed for users

Layer 3: `isTestTraffic()` at every INSERT site