'I'll Add the Migration Later'... The Lies I Told Myself

#postgres #database #devops #beginners

I built a 127-table PostgreSQL database over three months. For the first week, every schema change was tracked properly. Numbered SQL files, a migration table with checksums, clean and reproducible. Then development velocity picked up, I started prioritizing feature delivery over process, and the tracking fell behind.

By the time I audited the state, I had 19 tracked migrations from week 1 and over 100 untracked schema changes after that. No rollback capability. No record of what was applied when. No way to reproduce the database from scratch.

This post is about what that cost me, how I recovered, and why migration discipline matters more when you're moving fast, not less.

How the Tracking Fell Behind

Week 1 was clean. Custom migration system, numbered SQL files, SHA-256 checksums:

migrations/
  000_schema_migrations.sql
  001_init.sql
  002_init-finance.sql
  003_init-memory.sql
  ...
  018_add-gocardless-columns.sql

Then the pace changed. I was running multiple development sessions in parallel, each one producing features that needed schema changes. The quick path was deploying SQL directly to the running database:

ssh docker-host "docker exec -i postgres psql -U user -d db" <<< "
ALTER TABLE email_messages ADD COLUMN knowledge_extracted_at TIMESTAMPTZ;
CREATE INDEX idx_emails_knowledge ON email_messages (knowledge_extracted_at);
"

The SQL was always written properly. The change itself was correct. What I skipped was registering it in the migration system. Every time, the reasoning was the same: the feature is urgent, the schema change is simple, I'll add the migration file when things slow down.

Things never slowed down. Over 11 weeks, the live database grew to 127 tables while the migration history still described 60.

The situation got more complicated with parallel development. Two sessions both needed to add columns to the same table. Both changes deployed fine because they were different columns. But neither was tracked. A week later, a new session read the schema files from git, assumed those columns didn't exist, and wrote code that conflicted with the live state.

What It Cost Me

The most expensive incident: I deployed a schema change while a classification workflow was processing a batch of emails. The workflow had already classified a bunch of entries and marked them as "in progress," but hadn't written the results yet. When the schema changed, the final INSERT failed. Those emails were now stuck. The pipeline wouldn't reprocess them because they were flagged as in-progress, but the results never landed. Manual recovery took a couple of hours.

The fix was simple once I found the stuck records. But the root cause wasn't the schema change itself. It was that I had no process for coordinating database changes with running workflows. No migration file meant no review step, no "is anything depending on this column right now" check before executing.

The ongoing friction was worse than any single incident. New development sessions would reference schema files from git that were weeks out of date. Code would target columns that existed in production but not in the tracked schema, or vice versa. When something looked wrong in the data, there was no migration history to tell me what changed and when.

The Recovery: Baseline and Move Forward

I couldn't retroactively track 100+ changes. The pragmatic choice was to accept that weeks 2-12 were untracked, snapshot the current state, and build proper tracking from that point forward.

I evaluated migration tools against my actual workflow. I write raw SQL with asyncpg, no ORM, so Alembic (which generates migrations from SQLAlchemy models) wasn't a fit. Flyway needs a JVM runtime. I went with dbmate: a single binary, plain SQL migration files with up/down sections, built-in state tracking, and rollback support. It matched how I already write database code without adding framework dependencies.

The baseline process:

Snapshot the live schema. pg_dump --schema-only exported the full database structure: 9,756 lines covering all 127 tables, indexes, constraints, and functions.
Convert it to a dbmate migration. Added dbmate's -- migrate:up and -- migrate:down markers so the tool recognizes the file format.
Register it as already applied. Created dbmate's tracking table and inserted the baseline version, so dbmate knows "this is the starting point, don't try to run it."
Preserved the old history. Renamed the original migration table to schema_migrations_legacy rather than dropping it, in case I ever need to reference the week 1 records.

After this, dbmate status showed 1 applied, 0 pending. Clean slate. The 9,756-line baseline is the single source of truth for the database structure.

The New Process

Every schema change now follows three rules:

The migration file gets created before the change is applied. Not after. The act of writing the migration IS the review step. It forces me to think about what I'm changing, whether anything depends on it, and how to reverse it.

Every migration has a rollback section. Even if the rollback is just "drop the column I added." This is what the -- migrate:up and -- migrate:down format gives you:

-- migrate:up
ALTER TABLE scheduler_tasks
    ADD COLUMN IF NOT EXISTS retry_count INT DEFAULT 0;
ALTER TABLE scheduler_tasks
    ADD COLUMN IF NOT EXISTS max_retries INT DEFAULT 3;

-- migrate:down
ALTER TABLE scheduler_tasks
    DROP COLUMN IF EXISTS retry_count;
ALTER TABLE scheduler_tasks
    DROP COLUMN IF EXISTS max_retries;

The IF NOT EXISTS / IF EXISTS pattern makes migrations idempotent. If something interrupts the process halfway, I can re-run safely.

The migration commits with the feature code. Same commit, same PR. git log always links a schema change to the feature that needed it. No more guessing when a column was added or why.

What I'd Tell Someone Starting a Similar Project

Track your schema changes from day one. Even if it's just numbered SQL files in a directory. The tooling doesn't matter as much as the habit. A migration system you actually use beats a sophisticated one you skip when you're in a hurry.

If you're already behind, don't try to retroactively reconstruct history. Snapshot your current state with pg_dump, declare it the baseline, and track forward. Accepting that some history is lost is better than pretending it's tracked when it isn't.

If you use AI coding tools for development, add migration creation to your workflow instructions. These tools generate schema changes fast and won't create migration files unless you explicitly ask them to. The more automated your development, the more important it is that the tracking step is built into the process rather than treated as a separate manual step.

And coordinate schema changes with your running services. A database migration that's technically correct can still cause problems if it modifies something that active workflows depend on. The migration file is the natural place to document those dependencies, even if it's just a comment at the top: "this column is used by the email classification pipeline, deploy the updated pipeline code first."

This is Part 4 of "One Developer, 22 Containers." The series covers building an AI office management system on consumer hardware, the choices, the trade-offs, and the things that broke along the way.

Find me on GitHub.