Modern Data Stack Migration — Day 1: Scaling to 8+ Companies with DRY Architecture and Chasing a $2M Discrepancy

Matheus Dallacort — Wed, 10 Jun 2026 12:41:17 +0000

Hello everyone! Following up on my previous post, Day 1 of my Modern Data Stack migration was an absolute rollercoaster of refactoring and deep data auditing.

I’m moving our legacy system (spreadsheets and Qlik) into a robust pipeline using Python, ClickHouse, and dbt. Here is what went down over the last 24 hours.

1. From Messy Scripts to a Single, Parameterized Extraction Engine 🛠️

In the legacy setup, each company had its own folder, its own .env file, and its own duplicated Python extraction script. It was a maintenance nightmare.

Yesterday, I completely refactored this structure:

Centralized Configuration: Merged all separate environments into a single, global .env file at the root level, mapping all 8+ companies and their branches.
Eliminated Code Duplication (DRY): Instead of having identical extraction logic copied across folders, I built a single, unified codebase. Now, we have one universal script for Sales, one for Stock, one for Orders, etc. The behavior changes dynamically based on the company argument we pass to the CLI (e.g., python -m extract.run extract --source company1).

To speed up this refactoring, I used Claude to generate the initial application skeleton. Since the AI already had the context of our legacy extraction logic, translating it into this new clean architecture was incredibly smooth.

2. Highs and Lows: The Data Parity Challenge

With the pipeline modernized, I ran the pilot ingestion for Company #1. To minimize friction for our downstream BI consumers, I kept the ClickHouse Bronze tables structured 1:1 with the legacy CSV schemas.

The Good News: The data ingestion into the Bronze layer worked flawlessly. Moving up to the Silver layer (where we do data cleaning and domain-specific transformations), everything validated beautifully. Row counts matched perfectly.
The "Fun" Part (The $2 Million Gap): When I materialized the Gold layer (our consolidated group business models), I hit a massive wall. The new pipeline reported $2 million USD more in revenue than the legacy system.

Why is there an inconsistency?

Engineering notes show an overcount in sales invoices. In Data Engineering, a difference this large usually means one thing: undocumented legacy business rules.

Right now, I'm auditing our dbt macros and transformation models. There is a high chance that the legacy system applies specific multi-company exclusions, cancellation filters, or tax logic that wasn't officially documented in the initial migration scope.

Next Steps

Audit the Gold layer rules: Write strict dbt tests to isolate exactly which invoice types are causing the inflation.
Fix the business logic: Align the multi-company macro constraints until we hit 100% data parity for Company #1.
Scale: Once the rule engine is bulletproof, start onboarding the remaining 7+ companies using our new centralized pipeline.

Data engineering is rarely about writing code that works perfectly on the first run; it’s about refactoring for scale and hunting down hidden business logic.

Has anyone else faced a massive data discrepancy during a migration?

Starting a Migration: Shifting from a Legacy Data System to a Modern Data Stack

Matheus Dallacort — Mon, 08 Jun 2026 14:41:57 +0000

Hello, DEV community!

I’m currently working as a developer/engineer, and our data architecture relies heavily on legacy structures (mostly spreadsheets and Qlik). While it served its purpose for a time, we’ve hit a wall. It’s hard to scale, maintenance is becoming a headache, and processing times are slowing us down.

To solve this, I’m kicking off a 3-month project to migrate this whole infrastructure to a Modern Data Stack. My goal is to build a reliable, low-latency, and scalable analytical pipeline.

The Target Stack

Ingestion/Extraction: Custom Python scripts (choosing code-first over no-code tools to maintain full control over payload manipulation, error handling, and performance).
Orchestration: Apache Airflow (for scheduling and monitoring our ingestion DAGs).
Data Warehouse: ClickHouse (leveraging its columnar power for lightning-fast query performance).
Transformation: dbt (data build tool) (to handle data modeling and testing directly inside the warehouse).

The Repository Structure

I spent some time structuring the project repository to ensure clean code practices from day one. Here is how I organized it:

extract/: Dedicated Python scripts for our data ingestion logic.
dbt/: For data models, macros, and schema tests.
orchestration/: Where the Airflow pipeline logic will live.
sql/: DDL initialization scripts for the warehouse setup.

I also included a CUTOVER.md file because planning how to safely switch off the legacy system is just as important as building the new one.

Why am I documenting this?

I'm writing this series as a public diary for two reasons:

To document my technical journey, challenges, and architectural decisions.
To practice explaining engineering concepts in English and connect with other data folks worldwide.

Next step: Setting up the local environment via Docker and writing the first custom Python extraction scripts.

If you have any tips on orchestrating Python ingestion scripts via Airflow into ClickHouse, let me know in the comments! Let's build.

DEV Community: Matheus Dallacort