DEV Community

ulku
ulku

Posted on

Detecting Schema Drift Before It Breaks Production

At some point it happens to everyone. You run alembic revision --autogenerate, look at the output — and there it is: a DROP TABLE for a table that doesn't exist in production. Or the opposite: the migration history looks clean, but the production table greets you with three extra columns nobody can account for.

Because the application keeps running, the gap goes unnoticed for weeks. Then one day someone asks: "What's the actual difference between our models and the database?" Everyone looks at each other.

The tool I wrote to answer that question is SmartMigrate — a library that compares a live database against SQLAlchemy models, flags every difference by risk level, and generates runnable SQL.


The Core Idea: Two Snapshots, One Diff

At the center of the library is SchemaSnapshot — a full picture of a database schema at a given moment: tables, columns, types, indexes, foreign keys, nullable/unique/default metadata. This Pydantic model can be serialized to JSON and version-controlled in a CI/CD pipeline.

A snapshot can be taken from two sources:

From models (offline): SQLAlchemy declarative model classes are imported and Model.__table__ is introspected. No live database connection needed.

From a live database: sqlalchemy.inspect(engine) reflects the current schema.

from smartmigrate.snapshot import snapshot_from_models, snapshot_from_url

snap_model = snapshot_from_models("app.models", version="model")
snap_db = snapshot_from_url("mysql://root:pass@localhost:3306/mydb", version="db")
Enter fullscreen mode Exit fullscreen mode

Both snapshots serialize to JSON. In environments without database access, a previously saved file can be used — I'll get to that shortly.


Differ: What Changed?

Once two snapshots are in hand, diff_snapshots runs. The pipeline is straightforward:

  1. Tables in the old snapshot but not the new → potential DROP
  2. Tables in the new snapshot but not the old → potential ADD
  3. For tables present in both, column and index differences are compared

The raw output is a list of DiffOp objects. Supported operation types:

ADD_TABLE, DROP_TABLE, RENAME_TABLE
ADD_COLUMN, DROP_COLUMN, RENAME_COLUMN, MODIFY_COLUMN
ADD_INDEX, DROP_INDEX
ADD_FK, DROP_FK
Enter fullscreen mode Exit fullscreen mode
from smartmigrate import differ
from smartmigrate.safety import annotate

raw = differ.diff_snapshots(snap_db, snap_model)
annotated = annotate(raw)

for op in annotated.ops:
    print(f"[{op.risk.value:7}] {op.kind.value:20} {op.table}.{op.column or ''}")
    print(f"         Why: {op.reason}")
Enter fullscreen mode Exit fullscreen mode

A raw diff is just a list. The real question is what each difference actually means.


Rename Intelligence: RENAME, Not DROP+CREATE

Now we get to an interesting problem.

Alembic's autogenerate has a known limitation: when a column or table is renamed, it can't detect it. It reports the old name as DROP and the new name as CREATE. On large tables, that's an unnecessary data-loss risk and a false alarm.

"Can we do better?" was the starting point. SmartMigrate's rename detection module solves this with structural similarity scoring.

Column Renames

When a column "disappears" and another "appears", a composite score is calculated between the two:

score = 0.55 × name_sim + 0.30 × type_sim + 0.15 × pos_sim
Enter fullscreen mode Exit fullscreen mode
  • name_sim: character-level similarity between the two names via SequenceMatcher (0–1)
  • type_sim: type compatibility — 1.0 for same type, 0.5 for compatible types (INT and BIGINT), 0.0 otherwise
  • pos_sim: similarity of the column's ordinal position within the table

Why isn't name similarity alone enough? Because semantic shifts like pricecost exist. Without type and position weights, those pairs are impossible to catch. Greedy best-first assignment ensures each column is used in at most one match.

Default threshold: 0.72. Adjustable via environment variable:

export SMARTMIG_COL_RENAME_THRESHOLD=0.80
Enter fullscreen mode Exit fullscreen mode

Table Renames

For tables, Jaccard similarity over the column set is combined with name similarity:

score = 0.4 × name_similarity + 0.6 × column_jaccard
Enter fullscreen mode Exit fullscreen mode

If the threshold is met, the system emits RENAME_TABLE or RENAME_COLUMN instead of a DROP+CREATE pair.


Safety Engine: Every Change Has a Cost

When looking at a diff list, the real question in your head is: "What happens if I run this step?"

Adding a column and dropping a column don't carry the same weight. SmartMigrate assigns each DiffOp to one of three risk levels:

Operation Risk Why
Add nullable column SAFE Existing rows get NULL
Add non-unique index SAFE Only query performance changes
Add NOT NULL column (with default) SAFE Existing rows get a valid value
Rename table/column CAUTION Application code may break
Add unique index CAUTION Fails if duplicates exist
Add foreign key CAUTION Fails if orphan rows exist
Drop table/column DANGER Permanent data loss
Add NOT NULL column (no default) DANGER Existing rows violate the constraint
Type change to incompatible family DANGER Risk of data corruption
nullable → NOT NULL DANGER Existing NULL rows violate the constraint

Each DiffOp gets not just a risk level, but also a reason explaining why the risk exists and a suggestion for how to mitigate it:

DANGER: Adding NOT NULL column 'fee' without a default to 'transaction'.
        Existing rows cannot satisfy the constraint.
Suggestion: Add a server_default so existing rows get a valid value.
            Alternatively, add as nullable first, backfill, then add NOT NULL.
Enter fullscreen mode Exit fullscreen mode

The risk matrix in safety.py goes beyond a mechanical diff — it contains contextual reasoning. Whether ADD_COLUMN NOT NULL is SAFE or DANGER depends entirely on whether server_default is present. Two lines of code, completely different risk profiles.


Migration Graph: Not a List, a DAG

Most migration systems treat changes as an ordered list. Is that enough? Usually yes — but not always.

SmartMigrate thinks differently: it builds a Directed Acyclic Graph (DAG) where each DiffOp is a node and dependencies are edges. Edges come from three rules:

  1. Intra-table ordering: RENAME_TABLE → ADD_COLUMN → MODIFY_COLUMN → ADD_INDEX → DROP_INDEX → DROP_COLUMN → DROP_TABLE

  2. Foreign key dependencies: An ADD_COLUMN that references another table must run after that table's ADD_TABLE step.

  3. Independent tables: Steps belonging to unrelated tables can run in any order.

The DAG is validated for cycles via networkx. A topological sort produces a guaranteed execution order. The graph can also be visualized:

# Print the migration DAG as ASCII
smartmigrate graph snapshots/prod.json snapshots/dev.json

# Export as Graphviz DOT
smartmigrate graph snapshots/prod.json snapshots/dev.json --format dot --out migration.dot
dot -Tpng migration.dot -o migration.png
Enter fullscreen mode Exit fullscreen mode

Practical Use: Autogenerate in One Command

In day-to-day use, most of the time there's one question: "What's the difference between my models and my database, and what SQL closes it?"

smartmigrate autogenerate \
  --package app.models \
  --db-url "mysql://root:pass@localhost:3306/mydb" \
  --message "add_fee_columns" \
  --out migrations/smartmigrate
Enter fullscreen mode Exit fullscreen mode

The command runs in four steps:

  1. Snapshot the database (via sqlalchemy.inspect)
  2. Snapshot the model package (offline, using stubs)
  3. Compute the DB → Model diff and run safety analysis
  4. Build the MigrationPlan and write four files
migrations/smartmigrate/
  20260429_130500_add_fee_columns.up.sql
  20260429_130500_add_fee_columns.down.sql
  20260429_130500_add_fee_columns.changelog.md
  20260429_130500_add_fee_columns.plan.json
Enter fullscreen mode Exit fullscreen mode

Inside .up.sql, each step is annotated with its risk level and description:

-- Step: 3a7f9c12ab4e
--   Adding nullable column 'fee' (DECIMAL(18,2)) to 'transaction'.
--   Risk: SAFE
-- --------------------------------------------------

ALTER TABLE `transaction` ADD COLUMN `fee` DECIMAL(18, 2) NULL;
Enter fullscreen mode Exit fullscreen mode

A plan containing DANGER steps is blocked by default. Use dry-run to inspect the plan before deciding:

# Show the plan without writing any files
smartmigrate autogenerate --package app.models --db-url "..." --dry-run

# Proceed through DANGER steps explicitly
smartmigrate autogenerate --package app.models --db-url "..." --allow-danger
Enter fullscreen mode Exit fullscreen mode

Offline Mode and CI/CD

Connecting directly to a production database from CI/CD is usually not possible. Or if it is, you probably don't want to.

SmartMigrate offers two approaches.

Snapshot file: Taken once when the connection is available, reused in CI:

# Capture a snapshot from the live DB
smartmigrate snapshot \
  --db-url "mysql://..." --version prod-2026-04-29 --out snapshots/prod.json

# In CI: autogenerate from the saved snapshot
smartmigrate autogenerate \
  --package app.models \
  --from-db-snapshot snapshots/prod.json \
  --message "add_fee_columns"
Enter fullscreen mode Exit fullscreen mode

Model snapshots are fully offline: When snapshot_from_models runs, application startup code is never triggered. Native drivers like MySQLdb and psycopg2 are stubbed with MagicMock; app.db.database.Base is replaced with a real declarative_base(). For large applications that spin up Redis, Celery, or OpenTelemetry on boot, this matters — you can analyze the model schema without starting the application.


Python API

The API behind the CLI can be used directly. Every step above is callable individually:

from smartmigrate.snapshot import snapshot_from_models, snapshot_from_url
from smartmigrate.planner import plan_from_snapshots, render_sql_file
from smartmigrate.explainer import explain_plan, generate_changelog
from smartmigrate.safety import safety_report

snap_db    = snapshot_from_url("mysql://root:pass@localhost/mydb", version="db")
snap_model = snapshot_from_models("app.models", version="model")

plan = plan_from_snapshots(snap_db, snap_model, dialect="mysql")

report = safety_report(plan)
print(f"Danger ops: {report['risk_counts']['danger']}")

up_sql = render_sql_file(plan, direction="up")
changelog = generate_changelog(plan)
print(explain_plan(plan))
Enter fullscreen mode Exit fullscreen mode

Revision History

Every time autogenerate runs, a new entry is appended to smartmigrate_versions.json — tracking each revision's ID, message, risk summary, and applied status.

smartmigrate history --out migrations/smartmigrate
Enter fullscreen mode Exit fullscreen mode
┌─────────────────────────────────────┬──────────────────────┬───────────┬──────────┐
│ Revision ID                         │ Message              │ DB→Model  │ Status   │
├─────────────────────────────────────┼──────────────────────┼───────────┼──────────┤
│ 20260429_130500_add_fee_columns      │ add_fee_columns      │ db→model  │ pending  │
│ 20260402_091200_rename_user_id       │ rename_user_id       │ db→model  │ applied  │
└─────────────────────────────────────┴──────────────────────┴───────────┴──────────┘
Enter fullscreen mode Exit fullscreen mode

Architecture

The library is ten modules, each with a single responsibility:

snapshot.py         SchemaSnapshot from models or a live DB
differ.py           Two-snapshot comparison: raw DiffOp list
rename_detector.py  Rename detection via structural similarity
safety.py           Risk classification per DiffOp
graph.py            DAG construction and topological sort
sql_generator.py    SQL DDL generation from DiffOps
planner.py          Pipeline orchestrator
explainer.py        Human-readable output and changelog
versioning.py       Revision IDs and file management
cli.py              Typer-based CLI
Enter fullscreen mode Exit fullscreen mode

differ.py doesn't know about risk — it just finds differences. safety.py doesn't know about SQL — it just assigns risk. sql_generator.py doesn't know about dependencies — it just generates SQL. This separation makes each layer independently testable:

def test_not_null_no_default_is_danger():
    old = snap("a", {"t": table("t", [col("id")])})
    new = snap("b", {"t": table("t", [col("id"), col("required", nullable=False)])})
    diff = quick_diff(old, new)
    add_ops = [op for op in diff.ops if op.kind == DiffKind.ADD_COLUMN]
    assert add_ops[0].risk == RiskLevel.DANGER
Enter fullscreen mode Exit fullscreen mode

Why Alongside Alembic?

SmartMigrate isn't here to replace Alembic — it's here to catch what Alembic doesn't see.

  • Detecting renames as RENAME instead of misreading them as DROP+CREATE
  • Explaining why each change carries risk
  • Tracking dependencies between migration steps
  • Keeping model snapshots under version control in CI/CD

The two work well together: use SmartMigrate to understand what changed, Alembic to apply it.


Installation

pip install smartmigrate
Enter fullscreen mode Exit fullscreen mode

To install from source and run the demo — no database connection required:

git clone https://github.com/bayraktarulku/smartmigrate
cd smartmigrate
pip install -e .
python demo.py
Enter fullscreen mode Exit fullscreen mode

The demo compares two snapshots: v1.5.0 → v1.6.0. Column rename detection, NOT NULL column risk, a new table and a dropped table — all visible in the terminal.


Conclusion

The gap between models and the database isn't a bug. It's inevitable. The question is how and how often you measure it.

SmartMigrate answers "why is this change risky?" for every operation, avoids misreading renames as data loss, and treats migrations as a connected graph rather than an independent list. If you spot something missing or wrong, feel free to open an issue.

Source code: github.com/bayraktarulku/smartmigrate


References

Top comments (0)