DEV Community

Cover image for 137 migrations and counting: building a personal OS schema from scratch
Víctor
Víctor

Posted on • Originally published at micelclaw.com

137 migrations and counting: building a personal OS schema from scratch

I started this project on February 18th, 2026. The idea was embarrassingly simple: I wanted a personal assistant that could look up my contacts fast and without burning through tokens. Notes, emails, a diary — the basics. A small Fastify server, a PostgreSQL database, maybe ten tables.

Somewhere along the way, I ended up with 137 SQL migrations.

This is the story of how a "quick personal cloud" turned into what we now call a sovereign digital operating system — and what the schema evolution looks like when you're building the plane while flying it.

Day 1: the schema that fit on a napkin

Here's the first migration, 0000_initial_schema.sql. Thirteen tables. I remember thinking "this covers everything I'll ever need":

CREATE EXTENSION IF NOT EXISTS "pgcrypto";
CREATE EXTENSION IF NOT EXISTS "vector";

CREATE TABLE IF NOT EXISTS notes (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title           TEXT,
    content         TEXT NOT NULL,
    content_format  VARCHAR(20) DEFAULT 'markdown',
    source          VARCHAR(50) NOT NULL DEFAULT 'local',
    tags            TEXT[],
    pinned          BOOLEAN DEFAULT false,
    archived        BOOLEAN DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at      TIMESTAMPTZ
);
Enter fullscreen mode Exit fullscreen mode

Notes, events, contacts, emails, files, diary entries, a CRM table (ambition!), Home Assistant events, agent conversations, entity links, OAuth tokens, a license cache, and an embeddings table with pgvector.

Look at that embeddings table:

CREATE TABLE IF NOT EXISTS embeddings (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    domain          VARCHAR(30) NOT NULL,
    record_id       UUID NOT NULL,
    content_text    TEXT NOT NULL,
    embedding       vector(768) NOT NULL,
    model_used      VARCHAR(100),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(domain, record_id)
);
Enter fullscreen mode Exit fullscreen mode

One table for all embeddings. Domain column to tell apart notes from emails. Simple. Clean.

That simplicity lasted about four days.

The decisions that shaped everything

Three early decisions ended up defining the entire architecture. I didn't fully appreciate their impact at the time.

Everything in public schema

We put all tables in PostgreSQL's default public schema instead of creating separate schemas per domain (a notes schema, an emails schema, etc.). The reasoning was practical: every record in the system — notes, events, contacts, files — needs to go through the same intelligence pipeline. Embeddings. Entity extraction. Heat tracking. Knowledge graph edges. Change log entries.

If each domain lived in its own schema, every cross-domain feature would need cross-schema joins and duplicated hooks. With everything in public, a single set of CRUD hooks can process any record type through the same pipeline. A note and an email get embedded, heat-tracked, entity-extracted, and changelog'd by the exact same code path.

This decision paid off enormously when we built the knowledge graph. An entity link between a contact and a calendar event is just a row in entity_links — no cross-schema ceremony.

Soft delete everywhere

Every single domain table has deleted_at TIMESTAMPTZ. No exceptions. From day one.

This wasn't wisdom — it was paranoia. But it turned out to be critical for the sync engine. When you sync with Gmail or Google Calendar, you need to know what was deleted locally so you can propagate that deletion. A hard delete means lost information. deleted_at means you can diff between "this record was deleted" and "this record never existed."

It also enabled restore functionality for free. PATCH /notes/:id/restore is just SET deleted_at = NULL.

UUID primary keys on everything

Not SERIAL, not BIGINT. UUIDs. Every table.

The immediate benefit: entity_links can reference any table with just (source_type, source_id). No composite foreign keys, no lookup tables. The knowledge graph doesn't care if the source is a note or an email — it's always a UUID.

The less obvious benefit: when we added multi-user support later, there were zero ID collision issues. Users get created on different instances, sync happens, and UUIDs just work.

The migration timeline: from 0 to 137

Here's roughly how the schema evolved, grouped by what was happening:

Migrations 0000–0010: The foundation

The initial thirteen tables. Then tsvector search columns for free-tier full-text search. Then the sync engine infrastructure — change_log, sync_connectors, sync_history. Custom fields support. Email accounts. File snapshots and shared links. Photo albums and face clusters.

At this point we had maybe 20 tables and the system could do CRUD, search, and sync. A functional personal cloud. I could have stopped here.

I did not stop here.

Migrations 0050–0065: The intelligence layer

This is where things got wild. We added:

  • change_log_affected_columns — tracking which specific fields changed, not just "this record was updated"
  • record_heat — heat scores that decay over time, turning every record into a memory that fades unless you interact with it
  • graph_entities — knowledge graph nodes (Person, Project, Location, Topic) extracted automatically by an LLM
  • Extended entity_links with link_type, confidence, strength, and created_by columns
  • semantic_cache — caching API responses by query embedding similarity
  • Reserved schemas for future domains (bookmarks, financial accounts, health data)

The record_heat table deserves its own blog post (coming soon), but here's the gist: every record has a heat score between 0 and 1. It decays exponentially over time. Every time you access a record, the heat bumps up. The search algorithm uses heat as a post-fusion multiplier — hot records rank higher than cold ones, all else being equal.

The knowledge graph was another inflection point. Suddenly entity_links wasn't just "this note mentions this contact" — it was "the system automatically discovered that Person:Ana García appeared in 3 emails, 2 notes, and tomorrow's calendar event, with confidence 0.85."

Migrations 0080–0103: Proactive intelligence

Sleep-time jobs. Derived insights. User preferences. Proactive feedback. Agent tokens. Audit logs. Data sensitivity labels. PII routing logs. Pseudonym maps.

This batch transformed the system from "a database you query" to "a database that thinks while you sleep." The sleep-time engine runs background LLM jobs during idle periods — discovering cross-domain correlations, extracting behavioral preferences, generating insights. All stored in tables that started as sketches on a whiteboard.

Migrations 0110–0137: The OS layer

App system tables. Email moderation. Clipboard items. Visual intelligence embeddings. Face detection infrastructure. Kanban boards with labels, comments, dependencies, and checklists. RSS feeds with categories and AI summarization.

By migration 0137, the schema had grown from "a personal notes database" into something that manages your entire digital life — notes, emails, calendar, contacts, files, photos, diary, projects, feeds, and an AI that understands the connections between all of it.

What I'd do differently

I'd plan the multi-user migration from day one. Migration 0009 (add_user_id_to_domain_tables) added user_id to every single domain table. Every. Single. One. That's a lot of ALTER TABLE statements, a lot of index recreations, and a lot of unique constraint changes (diary entries went from UNIQUE(entry_date) to UNIQUE(user_id, entry_date)). If I'd included user_id in the initial schema, that migration wouldn't exist.

I'd version-track the embedding model from the start. We added model_version to the embeddings table later. But by then we'd already generated thousands of embeddings with one model. When we switched from nomic-embed-text to qwen3-embedding:0.6b, we had to re-embed everything. If the model version had been there from day one, the re-embedding could have been incremental.

I'd actually apply my migrations. The single most common bug in development wasn't a code error — it was me forgetting to run a migration I'd already written. The schema in the SQL file said one thing; the database said another. We ended up with a simple for f in core/drizzle/0*.sql; do psql $DATABASE_URL -f "$f"; done command, but I should have automated it from week one.

The numbers

As of today:

Metric Count
SQL migration files 137
Tables in the schema ~50
Drizzle ORM schemas 35+
Domains (notes, events, etc.) 7 core + 5 extended
Index definitions 80+

Some migrations took minutes to write. Others took days of back-and-forth to get right.

The takeaway

If I could go back and tell February-18th me one thing, it would be: the schema is the product. Not the API. Not the frontend. Not the AI. The schema.

Every feature we built — heat scoring, knowledge graph, sleep-time intelligence, PII routing, multi-user isolation — started as a migration file. The tables defined the boundaries of what was possible. Getting the schema right (or at least right enough to iterate on) was the single highest-leverage activity in the entire project.

137 migrations sounds like a lot. And it is. But each one was a small, deliberate step from "personal notes database" to something much bigger. And we're not done — migration 0138 is already drafted.

notes module

This is the first in a series of technical posts about building a self-hosted AI productivity OS. Next up: how a 4-slot hook pipeline makes every CRUD operation feed embeddings, heat tracking, entity extraction, and the changelog — without any of them blocking each other.

Top comments (0)