<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Matheus Dallacort</title>
    <description>The latest articles on DEV Community by Matheus Dallacort (@dalla).</description>
    <link>https://dev.to/dalla</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3971409%2Fcb3119d6-1dbc-4f86-974a-93daf166abf0.png</url>
      <title>DEV Community: Matheus Dallacort</title>
      <link>https://dev.to/dalla</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dalla"/>
    <language>en</language>
    <item>
      <title>Modern Data Stack Migration — Day 1: Scaling to 8+ Companies with DRY Architecture and Chasing a $2M Discrepancy</title>
      <dc:creator>Matheus Dallacort</dc:creator>
      <pubDate>Wed, 10 Jun 2026 12:41:17 +0000</pubDate>
      <link>https://dev.to/dalla/modern-data-stack-migration-day-1-scaling-to-8-companies-with-dry-architecture-and-chasing-a-34e2</link>
      <guid>https://dev.to/dalla/modern-data-stack-migration-day-1-scaling-to-8-companies-with-dry-architecture-and-chasing-a-34e2</guid>
      <description>&lt;p&gt;Hello everyone! Following up on my &lt;a href="https://dev.to/matheus_dallacort_9c05897/starting-a-migration-shifting-from-a-legacy-data-system-to-a-modern-data-stack-1m4f"&gt;previous post&lt;/a&gt;, Day 1 of my Modern Data Stack migration was an absolute rollercoaster of refactoring and deep data auditing. &lt;/p&gt;

&lt;p&gt;I’m moving our legacy system (spreadsheets and Qlik) into a robust pipeline using &lt;strong&gt;Python, ClickHouse, and dbt&lt;/strong&gt;. Here is what went down over the last 24 hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. From Messy Scripts to a Single, Parameterized Extraction Engine 🛠️
&lt;/h3&gt;

&lt;p&gt;In the legacy setup, each company had its own folder, its own &lt;code&gt;.env&lt;/code&gt; file, and its own duplicated Python extraction script. It was a maintenance nightmare.&lt;/p&gt;

&lt;p&gt;Yesterday, I completely refactored this structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Configuration:&lt;/strong&gt; Merged all separate environments into a single, global &lt;code&gt;.env&lt;/code&gt; file at the root level, mapping all 8+ companies and their branches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eliminated Code Duplication (DRY):&lt;/strong&gt; Instead of having identical extraction logic copied across folders, I built a single, unified codebase. Now, we have &lt;strong&gt;one universal script for Sales, one for Stock, one for Orders, etc.&lt;/strong&gt; The behavior changes dynamically based on the company argument we pass to the CLI (e.g., &lt;code&gt;python -m extract.run extract --source company1&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To speed up this refactoring, I used Claude to generate the initial application skeleton. Since the AI already had the context of our legacy extraction logic, translating it into this new clean architecture was incredibly smooth.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Highs and Lows: The Data Parity Challenge
&lt;/h3&gt;

&lt;p&gt;With the pipeline modernized, I ran the pilot ingestion for &lt;strong&gt;Company #1&lt;/strong&gt;. To minimize friction for our downstream BI consumers, I kept the ClickHouse Bronze tables structured 1:1 with the legacy CSV schemas. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Good News:&lt;/strong&gt; The data ingestion into the &lt;strong&gt;Bronze&lt;/strong&gt; layer worked flawlessly. Moving up to the &lt;strong&gt;Silver&lt;/strong&gt; layer (where we do data cleaning and domain-specific transformations), everything validated beautifully. Row counts matched perfectly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "Fun" Part (The $2 Million Gap):&lt;/strong&gt; When I materialized the &lt;strong&gt;Gold&lt;/strong&gt; layer (our consolidated group business models), I hit a massive wall. The new pipeline reported &lt;strong&gt;$2 million USD more in revenue&lt;/strong&gt; than the legacy system. &lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why is there an inconsistency?
&lt;/h3&gt;

&lt;p&gt;Engineering notes show an overcount in sales invoices. In Data Engineering, a difference this large usually means one thing: &lt;strong&gt;undocumented legacy business rules&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;Right now, I'm auditing our dbt macros and transformation models. There is a high chance that the legacy system applies specific multi-company exclusions, cancellation filters, or tax logic that wasn't officially documented in the initial migration scope.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audit the Gold layer rules:&lt;/strong&gt; Write strict &lt;code&gt;dbt tests&lt;/code&gt; to isolate exactly which invoice types are causing the inflation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix the business logic:&lt;/strong&gt; Align the multi-company macro constraints until we hit 100% data parity for Company #1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale:&lt;/strong&gt; Once the rule engine is bulletproof, start onboarding the remaining 7+ companies using our new centralized pipeline.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Data engineering is rarely about writing code that works perfectly on the first run; it’s about refactoring for scale and hunting down hidden business logic. &lt;/p&gt;

&lt;p&gt;Has anyone else faced a massive data discrepancy during a migration?&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>python</category>
      <category>dbt</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Starting a Migration: Shifting from a Legacy Data System to a Modern Data Stack</title>
      <dc:creator>Matheus Dallacort</dc:creator>
      <pubDate>Mon, 08 Jun 2026 14:41:57 +0000</pubDate>
      <link>https://dev.to/dalla/starting-a-migration-shifting-from-a-legacy-data-system-to-a-modern-data-stack-1m4f</link>
      <guid>https://dev.to/dalla/starting-a-migration-shifting-from-a-legacy-data-system-to-a-modern-data-stack-1m4f</guid>
      <description>&lt;p&gt;Hello, DEV community! &lt;/p&gt;

&lt;p&gt;I’m currently working as a developer/engineer, and our data architecture relies heavily on legacy structures (mostly spreadsheets and Qlik). While it served its purpose for a time, we’ve hit a wall. It’s hard to scale, maintenance is becoming a headache, and processing times are slowing us down.&lt;/p&gt;

&lt;p&gt;To solve this, I’m kicking off a 3-month project to migrate this whole infrastructure to a &lt;strong&gt;Modern Data Stack&lt;/strong&gt;. My goal is to build a reliable, low-latency, and scalable analytical pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Target Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion/Extraction:&lt;/strong&gt; Custom Python scripts (choosing code-first over no-code tools to maintain full control over payload manipulation, error handling, and performance).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration:&lt;/strong&gt; Apache Airflow (for scheduling and monitoring our ingestion DAGs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse:&lt;/strong&gt; ClickHouse (leveraging its columnar power for lightning-fast query performance).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformation:&lt;/strong&gt; dbt (data build tool) (to handle data modeling and testing directly inside the warehouse).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Repository Structure
&lt;/h3&gt;

&lt;p&gt;I spent some time structuring the project repository to ensure clean code practices from day one. Here is how I organized it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;extract/&lt;/code&gt;: Dedicated Python scripts for our data ingestion logic.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbt/&lt;/code&gt;: For data models, macros, and schema tests.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orchestration/&lt;/code&gt;: Where the Airflow pipeline logic will live.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sql/&lt;/code&gt;: DDL initialization scripts for the warehouse setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also included a &lt;code&gt;CUTOVER.md&lt;/code&gt; file because planning how to safely switch off the legacy system is just as important as building the new one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why am I documenting this?
&lt;/h3&gt;

&lt;p&gt;I'm writing this series as a public diary for two reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;To document my technical journey, challenges, and architectural decisions.&lt;/li&gt;
&lt;li&gt;To practice explaining engineering concepts in English and connect with other data folks worldwide.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Next step: Setting up the local environment via Docker and writing the first custom Python extraction scripts. &lt;/p&gt;

&lt;p&gt;If you have any tips on orchestrating Python ingestion scripts via Airflow into ClickHouse, let me know in the comments! Let's build. &lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>architecture</category>
      <category>learning</category>
      <category>database</category>
    </item>
  </channel>
</rss>
