HelperX

Posted on Jul 2

Migrating a SQLite Schema in Production Without Locking Writers

#sqlite #database #node #backend

I've written before about why we run SQLite in production for HelperX, and about multi-tenant isolation and backup strategy. There's one topic I kept deferring because it's genuinely hard and I didn't want to hand out advice I wasn't sure of: schema migrations.

SQLite is famous for a painful migration story. ALTER TABLE is deliberately limited — you can rename a table, add a column, or rename a column, but you can't drop a column, change a column's type, or add a non-constant default without rebuilding the table. The classic answer ("create a new table, copy the data, swap") is easy to get wrong under concurrent writes.

This article is the migration approach we settled on after breaking things a few times. It works under live write load, never loses data, and degrades gracefully if something fails midway.

Why SQLite migrations are uniquely hard

In Postgres, a migration like "drop column X" is a one-liner and (with appropriate locking) runs while the app keeps writing. SQLite can't do that. The supported ALTER TABLE operations are intentionally minimal — the SQLite maintainers chose simplicity over migration ergonomics.

The only way to make structural changes SQLite doesn't support natively is the table rebuild:

Create a new table with the desired schema.
Copy rows from the old table to the new.
Atomically rename new → old (within a transaction).

This works, but two things make it dangerous in production:

Locking. The copy holds a write lock for its duration. On a multi-GB table under steady writes, that's a multi-second stall where every writer blocks.
Failure recovery. If the rebuild fails midway (disk full, process killed), you need a clean rollback path that doesn't leave half-built tables around.

Our strategy addresses both.

The principle: small, additive, backward-compatible

Before getting to the rebuild technique, the more important lesson: most migrations don't need to be destructive, and the ones that don't are nearly free.

We categorize every schema change:

Change type	Example	Migration cost
Additive (new table, new column)	Add `last_seen_at` column	Cheap — native `ALTER TABLE ADD COLUMN`
Additive index	New index for a query	Cheap — `CREATE INDEX`, runs in background
Column rename	`user_id` → `author_id`	Medium — needs dual-write window
Type change / column drop	Change int to text, drop a column	Expensive — full rebuild

Our rule: additive changes ship immediately; destructive changes are rare and planned. The vast majority of real-world schema evolution is additive — new features add tables and columns; they rarely restructure existing ones. By keeping destructive changes rare, we keep the hard technique rare too.

When we do need a destructive change, here's how.

The dual-write migration pattern

For changes that require a rebuild (drop a column, change a type, restructure), we avoid doing it in one big locked transaction. Instead we use a phased approach that never blocks writers for more than milliseconds.

Phase 1: Add the new shape alongside the old

We don't touch the existing table. We create the new table with the target schema:

-- Old table: posts (has a column we want to drop)
-- New table: posts_v2 (without that column)
CREATE TABLE posts_v2 (
  post_id    TEXT PRIMARY KEY,
  slot_id    TEXT NOT NULL,
  content    TEXT NOT NULL
  -- note: legacy_column is gone
);

This is additive (a new table) — it doesn't lock anything. Writers continue to the old table, undisturbed.

Phase 2: Dual-write

We update the application to write to both tables:

async function insertPost(post) {
  await db.run('INSERT INTO posts (...) VALUES (...)', post);
  await db.run('INSERT INTO posts_v2 (...) VALUES (...)', post); // also write new shape
}

From this moment, every new row exists in both tables. Reads still come from the old table. We run this dual-write for long enough that all active data has propagated to the new table — typically a few days for our workload.

Phase 3: Backfill the historical data

New rows are now in both tables, but old rows (written before Phase 2) exist only in the old table. We backfill them in small, lock-friendly batches:

async function backfillPosts(batchSize = 1000) {
  let offset = 0;
  while (true) {
    // Copy a batch of rows that aren't yet in posts_v2
    const rows = await db.all(`
      SELECT * FROM posts
      WHERE post_id NOT IN (SELECT post_id FROM posts_v2)
      LIMIT ?`, batchSize);

    if (rows.length === 0) break;

    await db.run('BEGIN');
    try {
      for (const row of rows) {
        await db.run(`INSERT INTO posts_v2 VALUES (?, ?, ?)`,
          row.post_id, row.slot_id, row.content);
      }
      await db.run('COMMIT');
    } catch (e) {
      await db.run('ROLLBACK');
      throw e;
    }

    offset += batchSize;
    await sleep(100); // yield to writers between batches
  }
}

Two details make this safe under live load:

Small batches. Each transaction copies only 1,000 rows, holding the write lock for milliseconds. Writers never wait long.
Sleep between batches. A 100ms pause between transactions lets queued writers drain. The backfill takes longer but never starves the application.

Phase 4: Switch reads

Once the backfill completes (the new table has every row the old one does), we switch reads to the new table:

async function getPost(postId) {
  // Now reading from posts_v2
  return db.get('SELECT * FROM posts_v2 WHERE post_id = ?', postId);
}

Dual-writes continue, so the two tables stay in sync. If we discover a problem with reads against the new table, we can switch reads back to the old table in seconds — both are still current.

Phase 5: Stop dual-writing, drop the old table

After running on the new table with confidence (we wait a few days), we stop the dual-write and remove the old table:

// Application now writes only to posts_v2
async function insertPost(post) {
  await db.run('INSERT INTO posts_v2 (...) VALUES (...)', post);
}

DROP TABLE posts;

Finally, we rename posts_v2 to posts so the application code can drop the _v2 suffix — though we often leave the suffix for a release cycle to keep the rename itself trivially reversible.

Why this beats a locked rebuild

A naive rebuild does all the work in one locked transaction: copy every row, swap, commit. On a large table, that lock can hold for seconds — an eternity for a live application. If it fails midway, you've held the lock for nothing and you start over.

The dual-write pattern spreads the work across time:

Phase 1–2 (add table, start dual-write): milliseconds of locking, near-instant.
Phase 3 (backfill): runs over hours/days in tiny batches, each holding the lock for milliseconds. Application writers are never blocked.
Phase 4–5 (switch reads, drop old): near-instant.

At no point is the application unable to write. At no point is there a multi-second lock. And at every phase, we have a rollback path: if reads against the new table misbehave, flip back to the old one — it's still being dual-written.

The rollback guarantee

This is the property that makes the pattern worth the complexity: every phase is independently reversible until the very last step.

After Phase 2 (dual-write started): roll back by stopping the dual-write and dropping posts_v2. Old table untouched.
After Phase 3 (backfilled): same — old table is still the source of truth.
After Phase 4 (reads switched): roll back by switching reads back to old table. Both still in sync.
After Phase 5 (old table dropped): not reversible. This is the commit point.

By deferring the irreversible step to the very end, we can run the migration for days with the ability to abort at any time. Only when we've confirmed the new table behaves correctly under real load do we take the irreversible step. That's the safety property that a one-shot locked rebuild fundamentally can't offer.

Migration tooling: keep it boring

We don't use a fancy migration framework. We use numbered SQL files and a tiny runner:

async function runMigrations(db) {
  await db.run(`CREATE TABLE IF NOT EXISTS schema_migrations (
    id INTEGER PRIMARY KEY,
    applied_at INTEGER NOT NULL
  )`);

  const applied = new Set(
    (await db.all('SELECT id FROM schema_migrations')).map(r => r.id)
  );

  const files = fs.readdirSync('./migrations')
    .filter(f => f.endsWith('.sql'))
    .sort();

  for (const file of files) {
    const id = parseInt(file.split('_')[0]);
    if (applied.has(id)) continue;

    const sql = fs.readFileSync(`./migrations/${file}`, 'utf8');
    await db.run('BEGIN');
    try {
      await db.exec(sql);
      await db.run('INSERT INTO schema_migrations (id, applied_at) VALUES (?, ?)',
        id, Date.now());
      await db.run('COMMIT');
    } catch (e) {
      await db.run('ROLLBACK');
      throw new Error(`Migration ${file} failed: ${e.message}`);
    }
  }
}

Each migration is a numbered SQL file (014_add_posts_v2.sql, 015_backfill_posts_v2.sql, etc.). The runner applies them in order, recording each in schema_migrations. Failed migrations roll back and halt startup — we never run with a partially-applied schema.

The dual-write pattern maps onto this naturally: Phase 1–2 is one migration (create table + the application update ships separately), Phase 3 backfill is a separate background job (not a migration — it runs continuously), Phase 4–5 is a later migration once we're confident.

Edge cases worth handling

Concurrent migrations on multi-process setups. SQLite's file lock prevents two processes from migrating simultaneously — the second waits. But we also take an application-level advisory lock at startup to serialize migrations explicitly, so the behavior is predictable rather than relying on SQLite's lock timing.

A migration that adds a NOT NULL column with no default. SQLite can't do this on a populated table in one step. We add the column nullable, backfill values, then (in a later migration) enforce non-null at the application layer — because SQLite can't easily add a CHECK constraint to an existing column either. The application enforces what SQLite can't.

The backfill catching up to live writes. If the backfill is slow and live writes are fast, the backfill might never finish — it copies old rows while new ones arrive. We handle this by backfilling in creation-order (oldest first), so the "not yet copied" set shrinks monotonically and we eventually catch up. Live writes during backfill go to both tables via the dual-write, so they're never missed.

Long-running transactions during backfill. A backfill transaction that runs too long can starve writers. The 1,000-row batch + 100ms sleep keeps each transaction short. We also abort any backfill batch that takes more than 2 seconds (unlikely, but defensive).

What we learned

1. Prefer additive changes, always. Most schema evolution is additive and nearly free in SQLite. Reserve the rebuild technique for genuine structural changes, which should be rare.

2. Dual-write beats locked rebuild. Spreading the migration across phases with tiny per-batch locks keeps the application responsive throughout. The total migration takes longer, but no single moment is disruptive.

3. Defer the irreversible step. Make every phase reversible until the very end. Run for days with the ability to abort. Only commit (drop the old table) when confident.

4. Backfill in small, yielding batches. Large copy transactions block writers. Small batches with sleeps between them never starve the application.

5. Boring tooling wins. Numbered SQL files and a 30-line runner are easier to reason about than a framework. Migrations should be transparent and auditable, not magic.

6. Let the application enforce what SQLite can't. NOT NULL on existing columns, complex CHECK constraints, cross-table invariants — enforce these in application code rather than forcing a rebuild to add a constraint SQLite can't add in place.

SQLite's migration limitations are real, but they're not prohibitive. The constraint forces a discipline — additive-first, dual-write for the rest, reversible phases, small batches — that happens to produce more reliable migrations than the "one big ALTER" approach that fancier databases enable. The limitations make you safer, if you respect them.

HelperX runs on SQLite in production with additive-first, dual-write migrations that never block writers — because reliable data layer is the foundation every other feature stands on. Free 30-day trial.

DEV Community