Legacy-DB Reverse Engineering & Migration

#python #pyodbc #msaccess #sqlite

1.47 million parts liberated from a 1.2 GB password-protected manufacturer database and migrated into the client's new system, incl. 82,076 converted exploded-view drawings, validated to zero rule violations, fully auditable.

The challenge

The client runs the after-market spare-parts business for an international Tier-1 construction-machinery manufacturer and holds a valid license for that manufacturer's maintenance database, a 1.2 GB password-protected MS Access file (.mdb) from the early 2000s. The lawfully acquired data needs to migrate into the client's new ERP / inventory system. Three walls in the way.

First: the database's original configuration parameters, specifically the access credentials, got lost internally over the years, and the legacy manufacturer tool will open the DB in the background but exports no raw data.

Second: the schema. Around 30 interlocked tables with cryptic column names, n:m relations between catalogues and sales models, part names spread across three tables plus a language field, none of it readable without structural analysis.

Third: roughly 82,000 exploded-view drawings in the obscure DjVu format of the early 2000s, which the new system doesn't render.

And all of it at a scale (~1.9 million raw rows) where Excel would silently truncate at 1,048,576 rows, effectively swallowing half the machinery fleet. The job: tear down all three walls and migrate the full, licensed dataset cleanly into the new system.

The approach

Forensic recovery of lost credentials. Rather than treating the legacy maintenance tool as a black box, the client's installation environment is forensically analysed: a targeted scan across the entire local install directory (.exe, .dll, .ini, .cfg, .xml) walks the configuration files and runtime artifacts looking for the classic indicators of persisted connection parameters, Jet OLEDB strings, PWD/UID references, .mdw paths, with a context window around every match. The configuration leftovers reconstruct the original credentials, which the client owns by virtue of his license anyway. From that point on, the DB is readable directly at the data layer via pyodbc + Microsoft Access Driver.
Structural schema reconstruction. The DB has no docs, no ER diagrams, only table names and column codes. A schema scanner pulls 10-row samples plus full column headers from each table and surfaces the relationships between the ~30 tables, which table holds replacement numbers, which holds image file IDs, where the language variants of part names live. That's the foundation for the later single-source join. Result: three laterally scattered sources for replacement-part numbers (one live mapping table plus two master-data tables with historical replacements), two for image files (with fallback chain), and one language-filtered source for English plain-text names.
One-shot cold storage in SQLite. The entire Access DB is cloned in a single take into a local SQLite file, every column as TEXT (the safest defence against the inconsistent typing of the source), in 10k chunks via fetchmany. Benefit: from then on every analysis runs locally, any number of times, without ODBC overhead, the 1.2 GB millstone becomes a 400 MB read-only source decoupled from the migration pipeline.
Multi-table join as single source of truth. A central query bundles all the schema knowledge into one statement. The parts catalogue joins with the catalogue-to-model mapping table and the sales-model table (because maintenance catalogues are named differently internally than the end-product models), the figures table (group / subgroup / image file via COALESCE as a fallback chain), the language-filtered plain-text-name table, and a CTE that condenses all replacement-part relationships per part into a comma-separated list with GROUP_CONCAT. Defensive LEFT JOINs with IFNULL catch empty key fields, the target system needs a complete tuple per row, not a sparse one.
Self-healing part names. The raw data often had only the placeholder "PART", pure alphanumeric codes, or junk like "(OPTIONAL)" in the "part name" field, values any modern ERP / shop system would flag as garbage immediately. Solution: an in-RAM dictionary built from the English-language master-data table maps part numbers to correct plain-text names. Heuristics (min length 3, not pure code string, not "PART" placeholder) decide when to look up, on a hit, the name gets healed, otherwise the record is marked UNKNOWN_NAME_REQUIRES_CHECK and hard-filtered in a later stage. The healing step alone rescues tens of thousands of rows from the trash filter.
DjVu → JPG pipeline with multithreading. ~82,000 exploded-view drawings sat in the obscure DjVu format, a format the new system can't render and for which there's no modern standard library. Pipeline: ddjvu (DjVuLibre) converts each file into a temporary PDF, PyMuPDF (fitz) renders the first page at 150 DPI in greyscale as JPG (massive disk-space win without legibility loss), the temp PDF is deleted immediately. ThreadPoolExecutor runs with os.cpu_count() workers. Idempotent skip logic (existing targets get skipped) makes the run re-runnable, a crash at file 50,000 doesn't cost 50,000 reconversions.
System-aware hyperlinks. Instead of bare paths, every image cell is written as =HYPERLINK("…\<path>.jpg", "<name>.jpg"), a format the target system understands directly as a clickable image link. A recursive index of all JPGs (filename → relative path) resolves the images before write; not-found images get defensively flagged as MISSING_JPG: <stem> rather than silently polluting the column or aborting the migration.
Iterative cleanup stages with audit trail. Raw → v2 (hyperlinks injected) → Pristine_v2 (UNKNOWN names + manually blocked regional special variants removed) → Pristine_v3 (string cleanup: semicolons, leading special chars, double whitespace) → Final_Delivered (replacement-part numbers re-loaded from both master-data sources, deduplicated) → Perfect (three last problem rows with polluted SubGroup removed). Every stage is its own file and its own script, debuggable, reproducible, with a clear audit trail. When someone later asks "why isn't row X in the new system?", there's an answer.
Strict-mode validation against the target schema. A dedicated validator checks the final CSV against a 5-rule schema contract that mirrors 1:1 what the target system accepts: header (exactly 10 columns in fixed order), required fields (catalogue + part number set), no illegal control characters (regex \x00-\x1f), part-number format (no parentheses), part-name format (uppercase, no leading special character, no UNKNOWN_NAME), Image_File format (=HYPERLINK + .jpg or explicit MISSING_JPG). Result after several iterations: 1,473,210 rows, zero rule violations. Every violation lands in a log file with row number + catalogue + part number + reason, surgical, not "something is broken".
Trust by transparency. Two separate reports ship alongside the data. Coverage report: all 386 master catalogues from the source DB with mapping to sales model and row count in the final CSV, the client sees, per catalogue, whether and how many parts were migrated into the new system. Trash analysis: all 470,276 filtered-out rows with reason, missing English name, manually blocked regional special variant, part number under four characters, other cleanup filter. The client gets not just the migration but every single filter decision documented.

The diagram shows the full migration flow: from forensic credential recovery through the opened 1.2 GB MS Access source database, the SQLite clone, the central multi-table join query, and the parallel DjVu-to-JPG conversion pipeline, all the way to hyperlink injection and the strict-mode validation against the new target system's schema.

The result

Vendor lock-in lifted. A 1.2 GB legacy database the client had legally licensed but could no longer practically use was forensically opened, its schema structurally reconstructed, and the full dataset migrated into his ownership, without further dependence on the legacy maintenance tool.
1,473,210 migrated parts records across 339 sales models (from 386 master catalogues including n:1 mappings, where one catalogue covers several model variants), ready for ingestion into the new system.
~82,000 original exploded-view drawings converted from old DjVu format into modern JPG and referenced via =HYPERLINK, defensively flagged where conversion failed, rather than blocking delivery.
Zero rule violations in the final CSV against a 5-rule strict-mode validation contract, the target system's acceptance criterion was binary, the result is binary.
1.2 GB legacy MDB → 400 MB SQLite (cold storage) → 274 MB CSV (migration delivery). Full data sovereignty beyond the proprietary manufacturer tool, in a format every modern ERP, shop, or BI system understands.
Full audit trail over every single one of the 470,276 dropped raw rows plus over each of the 386 catalogues, not a black-box ETL but a migration contract the client can trace row by row.
Excel hard limit dodged. The CSV stays streamable, the new system ingests sequentially, nothing gets silently cut off at 1,048,576 rows the way a naive Excel re-save would have done.

DEV Community

Legacy-DB Reverse Engineering & Migration

The challenge

The approach

The result

Top comments (0)