Modern Data Stack Migration — Day 1: Scaling to 8+ Companies with DRY Architecture and Chasing a $2M Discrepancy

#dataengineering #python #dbt #architecture

Hello everyone! Following up on my previous post, Day 1 of my Modern Data Stack migration was an absolute rollercoaster of refactoring and deep data auditing.

I’m moving our legacy system (spreadsheets and Qlik) into a robust pipeline using Python, ClickHouse, and dbt. Here is what went down over the last 24 hours.

1. From Messy Scripts to a Single, Parameterized Extraction Engine 🛠️

In the legacy setup, each company had its own folder, its own .env file, and its own duplicated Python extraction script. It was a maintenance nightmare.

Yesterday, I completely refactored this structure:

Centralized Configuration: Merged all separate environments into a single, global .env file at the root level, mapping all 8+ companies and their branches.
Eliminated Code Duplication (DRY): Instead of having identical extraction logic copied across folders, I built a single, unified codebase. Now, we have one universal script for Sales, one for Stock, one for Orders, etc. The behavior changes dynamically based on the company argument we pass to the CLI (e.g., python -m extract.run extract --source company1).

To speed up this refactoring, I used Claude to generate the initial application skeleton. Since the AI already had the context of our legacy extraction logic, translating it into this new clean architecture was incredibly smooth.

2. Highs and Lows: The Data Parity Challenge

With the pipeline modernized, I ran the pilot ingestion for Company #1. To minimize friction for our downstream BI consumers, I kept the ClickHouse Bronze tables structured 1:1 with the legacy CSV schemas.

The Good News: The data ingestion into the Bronze layer worked flawlessly. Moving up to the Silver layer (where we do data cleaning and domain-specific transformations), everything validated beautifully. Row counts matched perfectly.
The "Fun" Part (The $2 Million Gap): When I materialized the Gold layer (our consolidated group business models), I hit a massive wall. The new pipeline reported $2 million USD more in revenue than the legacy system.

Why is there an inconsistency?

Engineering notes show an overcount in sales invoices. In Data Engineering, a difference this large usually means one thing: undocumented legacy business rules.

Right now, I'm auditing our dbt macros and transformation models. There is a high chance that the legacy system applies specific multi-company exclusions, cancellation filters, or tax logic that wasn't officially documented in the initial migration scope.

Next Steps

Audit the Gold layer rules: Write strict dbt tests to isolate exactly which invoice types are causing the inflation.
Fix the business logic: Align the multi-company macro constraints until we hit 100% data parity for Company #1.
Scale: Once the rule engine is bulletproof, start onboarding the remaining 7+ companies using our new centralized pipeline.

Data engineering is rarely about writing code that works perfectly on the first run; it’s about refactoring for scale and hunting down hidden business logic.

Has anyone else faced a massive data discrepancy during a migration?