Hello, DEV community!
I’m currently working as a developer/engineer, and our data architecture relies heavily on legacy structures (mostly spreadsheets and Qlik). While it served its purpose for a time, we’ve hit a wall. It’s hard to scale, maintenance is becoming a headache, and processing times are slowing us down.
To solve this, I’m kicking off a 3-month project to migrate this whole infrastructure to a Modern Data Stack. My goal is to build a reliable, low-latency, and scalable analytical pipeline.
The Target Stack
- Ingestion/Extraction: Custom Python scripts (choosing code-first over no-code tools to maintain full control over payload manipulation, error handling, and performance).
- Orchestration: Apache Airflow (for scheduling and monitoring our ingestion DAGs).
- Data Warehouse: ClickHouse (leveraging its columnar power for lightning-fast query performance).
- Transformation: dbt (data build tool) (to handle data modeling and testing directly inside the warehouse).
The Repository Structure
I spent some time structuring the project repository to ensure clean code practices from day one. Here is how I organized it:
-
extract/: Dedicated Python scripts for our data ingestion logic. -
dbt/: For data models, macros, and schema tests. -
orchestration/: Where the Airflow pipeline logic will live. -
sql/: DDL initialization scripts for the warehouse setup.
I also included a CUTOVER.md file because planning how to safely switch off the legacy system is just as important as building the new one.
Why am I documenting this?
I'm writing this series as a public diary for two reasons:
- To document my technical journey, challenges, and architectural decisions.
- To practice explaining engineering concepts in English and connect with other data folks worldwide.
Next step: Setting up the local environment via Docker and writing the first custom Python extraction scripts.
If you have any tips on orchestrating Python ingestion scripts via Airflow into ClickHouse, let me know in the comments! Let's build.
Top comments (0)