DEV Community: Pizofreude

dlt MCP Server for Popular IDEs

Pizofreude — Wed, 18 Feb 2026 14:42:17 +0000

Overview

This demo showcases how to set up and use the dlt MCP Server for data pipeline validation and inspection. The MCP server enables interactive querying and management of dlt pipelines, including data inspection, row counts, and load validation.

Prerequisites

UV installed on your local machine.
dlt workspace installed and configured.
A pyproject.toml file with the necessary dependencies.

Setup

1. Configure MCP Server

VS Code

Open VS Code and access Settings (Command+Shift+P).
Navigate to Tools > MCP.
Click Add Custom MCP to create/open the mcp.json file.
Add the configuration for the dlt MCP Server to mcp.json:

   {
     "name": "dlt-mcp-server",
     "command": "dlt pipeline ...",
     "args": ["--with", "duckdb"]
   }

Ensure you include the duckdb dependency if using a DuckDB destination.
1. Save the file. The MCP server will automatically update within a few seconds.

Cursor

Open Cursor and access Settings (Command+,).
Navigate to Extensions > MCP.
Click Add Custom MCP to create/open the mcp.json file.
Add the configuration for the dlt MCP Server to mcp.json:

   {
     "name": "dlt-mcp-server",
     "command": "dlt pipeline ...",
     "args": ["--with", "duckdb"]
   }

Save the file. The MCP server will automatically update within a few seconds.

Kiro

Open Kiro and access Preferences (Command+,).
Navigate to Plugins > MCP.
Click Add Custom MCP to create/open the mcp.json file.
Add the configuration for the dlt MCP Server to mcp.json:

   {
     "name": "dlt-mcp-server",
     "command": "dlt pipeline ...",
     "args": ["--with", "duckdb"]
   }

Save the file. The MCP server will automatically update within a few seconds.

Claude Desktop

Open Claude Desktop and access Settings (Command+,).
Navigate to Integrations > MCP.
Click Add Custom MCP to create/open the mcp.json file.
Add the configuration for the dlt MCP Server to mcp.json:

   {
     "name": "dlt-mcp-server",
     "command": "dlt pipeline ...",
     "args": ["--with", "duckdb"]
   }

Save the file. The MCP server will automatically update within a few seconds.

Other IDEs (e.g., PyCharm, IntelliJ, Sublime Text)

Locate the MCP configuration section in your IDE's settings.
Create or open the mcp.json file.
Add the configuration for the dlt MCP Server to mcp.json:

   {
     "name": "dlt-mcp-server",
     "command": "dlt pipeline ...",
     "args": ["--with", "duckdb"]
   }

Save the file. The MCP server will automatically update within a few seconds.

Usage

1. Test MCP Server

Open a chat in your IDE and ask:

  What pipelines are available?

The MCP server should list the available pipelines (e.g., GitHub pipeline).

2. Inspect Pipeline Data

Ask:

  What tables are in this pipeline?

The server will list tables (e.g., commits, contributors).
- Ask:

  When was the data last loaded?

The server will provide the timestamp of the last data load.

3. Validate Data

Ask:

  How many rows are in the commits table?

If the MCP server lacks dependencies (e.g., duckdb), it will throw an error. Update the mcp.json configuration to include the missing dependency and retry.

4. Agentic Help

Ask:

  How many rows will be extracted in the next run in commits?

The MCP server will analyze the pipeline and confirm if incremental loading is applied. If not, it will fetch all existing rows plus any new data since the last run.

Troubleshooting

Dependency Errors: Ensure all required dependencies (e.g., duckdb) are included in the mcp.json configuration.
Configuration Updates: After modifying mcp.json, wait a few seconds for the MCP server to apply changes.
IDE-Specific Issues: Refer to your IDE's documentation for MCP-related troubleshooting.

Conclusion

The dlt MCP Server simplifies pipeline management by enabling interactive data inspection and validation. Customize the mcp.json configuration to support your specific pipeline destinations and dependencies.

Peer Review 3: France Data Engineering Job Market Transformations, Visualization, and Feedback (Part 2)

Pizofreude — Fri, 02 May 2025 13:42:31 +0000

Introduction

Welcome back to the last part peer review of the France Data Engineering Job Market Analysis pipeline. In Part 1, we explored the project’s infrastructure, cloud setup, and orchestration. Now, we’ll go deeper into the heart of the data platform: transformations, data warehouse design, dashboarding, reproducibility, and actionable feedback.

1. Transformations with dbt

Modern data engineering pipelines are built on modular, testable transformations—and dbt (Data Build Tool) shines in this space. This project structures its dbt codebase into staging, core, and marts layers, following best practices for maintainability and scalability.

Staging Models: Clean and standardize raw job posting data.
Core Models: Build core analytical tables, e.g., fact_jobs, dim_company, dim_skills.
Marts Models: Deliver analytics-ready tables for direct dashboard consumption, e.g., top skills, salary distribution, remote job trends.

Integration:

dbt transformations are automated via Kestra, ensuring that new data is transformed and ready for analytics on a regular schedule.

Comment:

Excellent use of dbt’s modular structure. The pipeline ensures all transformations are reproducible, testable, and production-ready. For further robustness, consider populating the tests/ and macros/ folders with custom tests and logic.

2. Data Warehouse Design

The project leverages Google BigQuery as the data warehouse, which is a solid choice for scalable analytics.

External Tables: Raw CSVs in GCS are registered as external tables in BigQuery.
Native Tables & Marts: Transformed data is materialized as native tables and views for efficient querying.

Partitioning & Clustering:

While the project’s structure suggests a thoughtful separation between staging and marts, there isn’t explicit documentation of table partitioning or clustering strategies. These can make a big difference in query performance and cost efficiency at scale.

Comment:

Good warehouse design with clear separation of concerns. Adding documentation and rationale for partitioning and clustering would further strengthen the warehouse layer.

3. Dashboarding & Data Products

The final data product is a Power BI dashboard that visualizes key insights:

Tiles include:
- Top skills in demand
- Salary distribution
- Remote work trends
- Company performance and job trends over time

The dashboard is visually clear (see screenshots in the images/ folder) and directly queries marts tables in BigQuery, ensuring up-to-date insights.

Comment:

Strong dashboard implementation. Multiple analytical tiles provide different perspectives for stakeholders, and the visuals are easy to interpret.

4. Reproducibility & Documentation

Reproducibility is a cornerstone of engineering excellence. This project excels in that area:

README.md includes step-by-step instructions for everything—infra setup, data ingestion, dbt transformations, and dashboard connection.
Sample config variables are provided, and the logical flow is easy to follow.

Comment:

Clear, actionable documentation makes this project easy to run and adapt. Excellent work!

5. Actionable Feedback & Areas for Growth

Even great projects have room to grow! Here are some opportunities for further improvement and learning:

Data Warehouse Optimization:
- Explicitly document partitioning and clustering strategies in BigQuery marts. Explain how these optimize for cost and performance.
Testing & CI/CD:
- Add dbt tests for data quality (e.g., uniqueness, null checks) and consider adding pipeline-level validation.
- Explore integrating CI/CD (e.g., GitHub Actions) for automated testing and deployment.
Workflow Transparency:
- Include diagrams or screenshots of Kestra flows in the documentation for better orchestration visibility.
Streaming Ingestion (Optional):
- If real-time job data becomes available, consider building a streaming ingestion pipeline to expand the project’s scope.
Leverage dbt Advanced Features:
- Utilize the empty macros/, tests/, and snapshots/ directories for more advanced dbt features, such as custom logic or snapshotting slowly changing dimensions.

Conclusion:

Reviewing and learning from real-world projects is one of the best ways to grow as a data engineer. This project is a fantastic example of a modern, cloud-native data engineering pipeline—well-documented, automated, and designed for actionable analytics.

Key takeaways:

Modular, testable transformations with dbt are the backbone of maintainable analytics pipelines.
Clear separation between raw, staging, and marts layers makes analytics scalable and robust.
Visualization is more than pretty charts—it’s about surfacing real insights for stakeholders.
Great documentation is as important as great code.

What’s next?

If you enjoyed this review, try evaluating an open-source project yourself and utilize the learning opportunities provided by Data Talks Club.

Peer Review 3: France Data Engineering Job Market Analysis Pipeline Infra (Part 1)

Pizofreude — Fri, 02 May 2025 13:30:01 +0000

Introduction

Welcome to the third peer review series for DataTalks Club Data Engineering Zoomcamp. In this post, I’ll be dissecting a real-world data engineering project that analyzes the French Data Engineering job market. The goal? To break down the project’s infrastructure, orchestration, and cloud design—spotlighting what works well, what could be improved, and, most importantly, what we can all learn as practicing data engineers.

Why do this? Because reviewing and sharing feedback on real-world projects sharpens our own skills, encourages open knowledge sharing, and helps us all grow together. Let’s dig in.

Project Overview

Project: DE-Job-Market-Analysis (GitHub)

Objective: Build an end-to-end, cloud-native pipeline to collect, store, transform, and visualize Data Engineering job postings for the French market.

Key questions addressed:

What is the demand for Data Engineering roles in France?
Which skills and tools are most sought after?
Which companies are hiring, and what are their workforce sizes?
What are the salary trends and geographic patterns?
How do job posting trends evolve over time?

Evaluation Criteria: How I Review

For this peer review series, I use a structured rubric inspired by the DataTalksClub Data Engineering Zoomcamp project guidelines. The main areas of focus are:

Problem Description
Cloud Infrastructure (and Infrastructure as Code)
Data Ingestion & Orchestration
(Part 2: Transformations, Dashboarding, Reproducibility, and Actionable Feedback)

1. Problem Description

Right from the start, the project’s README does an excellent job of motivating the work. It clearly explains why understanding the French Data Engineering job market matters, lays out the business context, and lists the specific insights the pipeline aims to deliver.

“The project aims to provide valuable insights into the demand for Data Engineering roles, most sought-after skills, key hiring companies, salary trends, locations, and job posting trends over time.”

Comment:

Excellent articulation! The clarity of context and objectives makes it easy for any reader (technical or not) to quickly understand the project’s purpose and value.

2. Cloud Infrastructure & IaC

This project is cloud-native, leveraging Google Cloud Platform (GCP) as the backbone.

Key services used:

BigQuery: The analytical data warehouse.
Google Cloud Storage (GCS): For raw and processed data storage.
Terraform: Infrastructure as Code for reproducible, automated cloud resource provisioning.

What Stands Out

The use of Terraform (with main.tf and variables.tf) to provision GCS buckets, BigQuery datasets, and service accounts is a mark of maturity—no click-ops here!
The README provides step-by-step instructions for configuring GCP variables, applying Terraform, and setting up service accounts.
The infrastructure is cleanly separated into its own directory, making the project modular and easy to maintain.

Comment:

Strong implementation of cloud and IaC best practices. The use of Terraform for GCP infra shows a solid grasp of production-grade deployments.

3. Workflow Orchestration: Batch Pipelines with Kestra

The pipeline’s automation is orchestrated using Kestra, a modern workflow orchestration tool (think: Apache Airflow alternative, but YAML-first and developer-friendly).

How It’s Used

Kestra flows automate job posting scraping (using JobSpy), data uploads to GCS, and the triggering of dbt transformations.
The orchestration logic is defined in YAML files, located in a dedicated kestra/ directory.
The workflow covers end-to-end batch scheduling: daily scraping, loading, and transformation, ensuring up-to-date analytics.

Batch vs. Streaming

This project focuses exclusively on batch processing—scraping and updating the dataset on a periodic schedule. There’s no streaming ingestion (like Kafka), which is appropriate for the type of data source used here (static job listings).

Comment:

Great use of Kestra for orchestrating a robust, modular DAG. For future iterations, consider adding diagrams or screenshots of the Kestra flows to make the orchestration even clearer for newcomers.

4. Data Ingestion: Batch (and What About Streaming?)

The ingestion process is classic batch ETL:

Scraping: Job postings are scraped using an external tool and saved as CSV.
Loading: CSVs are uploaded to GCS and registered as external tables in BigQuery.
Automation: All steps are orchestrated via Kestra.

Why not streaming?

The data source (job boards) doesn’t support real-time feeds, so batch scraping is pragmatic. If live job posting APIs were ever available, a streaming pipeline could be an exciting next step.

Comment:

The batch pipeline is well-automated and fit-for-purpose. The README makes it easy to understand and reproduce the process.

5. Contents of Interest (Project Structure Highlights)

dbt/ Directory:
- Contains a full dbt project (job_market_analysis) with:
  - dbt_project.yml (project config and structure).
  - models/ subdirectory with staging, core, and marts models.
  - schema.yml for dbt model/table testing and documentation.
  - macros/, tests/, seeds/, and snapshots/ folders are present but currently empty.
kestra/ Directory:
- YAML flow definitions for workflow orchestration.
terraform/ Directory:
- main.tf and variables.tf for GCP infrastructure provisioning.
docker-compose.yml:
- Used for local orchestration of services.
images/ Folder:
- Includes dashboard screenshots for reference.
README.md:
- Comprehensive, clear, and actionable documentation.

Conclusion & What’s Next

This project demonstrates a strong grasp of modern data engineering infrastructure design: cloud-native, reproducible, and automated. In Part 2, I’ll dive into the transformation layer (dbt), data warehouse design, dashboarding, reproducibility, and provide actionable feedback for the project author—and for all of us as data engineers.

Peer Review 2: Data Warehousing, Transformation, and Reproducibility in tfl-data-visualization (Part 2)

Pizofreude — Thu, 01 May 2025 16:01:32 +0000

Welcome to the second part of my peer review series of the tfl-data-visualization project—a cloud-native data engineering pipeline for analyzing passenger footfall at London Tube and TfL Rail stations.

In Part 1, we explored how the project defines its problem, leverages cloud infrastructure, and orchestrates data ingestion. In this post, we’ll take a closer look at the advanced analytics stages: how the project handles data warehousing, transformation, visualization, and reproducibility. We'll also wrap up with overall feedback and actionable suggestions.

4. Data Warehouse: BigQuery Partitioning for Scalable Analytics

A robust data warehouse is essential for analytical performance and cost-effectiveness. The project uses BigQuery to store and query processed data. Notably:

Partitioned Tables: The ingestion pipeline consolidates data into native BigQuery tables that are partitioned by travel date. This is a best practice for optimizing query speed and reducing costs in large, time-series datasets.
Clear Rationale: The README explains how and why partitioning is used, making it easy for reviewers and future maintainers to understand the design choices.
Accessible Data: Both external and native tables are created, supporting flexible exploration and downstream analytics.

Review Comment:

Excellent use of partitioning and cloud-native data warehouse features. If further improvements are desired, consider documenting or implementing clustering strategies for even more efficient queries.

5. Transformations with dbt

Transformations are at the core of any data pipeline. This project uses dbt (data build tool) to structure, document, and automate its transformation logic:

Modular SQL Models: The dbt/models/ directory contains SQL models, with at least one (station_footfall_daily.sql) handling daily aggregation at the station level.
Schema Documentation: The presence of schema.yml provides both dbt testing and documentation, ensuring models are validated and well-described.
Automated Execution: dbt runs are orchestrated via Kestra, guaranteeing transformations are up-to-date after each ingestion cycle.

Review Comment:

Very strong use of dbt for modular, testable, and automated transformations. The structure supports both maintainability and extensibility. Jinja SQL FTW!

6. Dashboarding: Visualizing Insights with Looker Studio

No analytics pipeline is complete without a way to visualize and share insights. The project delivers on this with a Looker Studio dashboard:

Multiple Interactive Tiles: The dashboard includes at least a time series chart and a station ranking table, providing multiple perspectives on the data.
Filtering & Interactivity: Users can filter by dimension (date, station, tap type, etc.), making the dashboard useful for different stakeholders.
Accessible Online: The dashboard is linked and screenshots are provided in the repository for transparency.

Review Comment:

Excellent dashboard implementation, with clear, actionable visualizations and interactive filtering that effectively communicate the project’s findings.

7. Reproducibility: Clear, Actionable Documentation

Reproducibility is key to collaboration and long-term success. This project excels in this regard:

Step-by-Step Instructions: The README covers everything from cloud credential setup to infrastructure provisioning, orchestration, and transformation.
Comprehensive Coverage: Both local and cloud-based setups are described, making the project accessible to a wide range of users.
Ease of Use: With the provided instructions, anyone with appropriate cloud access can reproduce the results without guesswork.

Review Comment:

Great job on ensuring reproducibility. The thorough, actionable documentation is a highlight, lowering the barrier for contributors and reviewers alike.

8. Summary of Feedback and Recommendations

This project stands out for its clarity, modular design, automation, and modern use of cloud and open-source tools. Here’s a recap of what it does well, and where there’s room to grow:

Strengths:

Clear business problem and motivation.
Fully automated, cloud-native architecture (GCP, Terraform, Kestra).
Partitioned BigQuery tables for scalable analytics.
Modular and reproducible transformations with dbt.
Interactive, insightful dashboard.
Excellent documentation and reproducibility.

Areas for Improvement:

Testing and CI/CD: Integrate dbt data tests and consider adding continuous integration for pipeline validation.
Monitoring & Alerting: Add pipeline monitoring and notification mechanisms for production robustness.
Orchestration Visualization: Include screenshots or diagrams of Kestra flows for enhanced documentation.
Warehouse Optimization: Consider clustering or additional optimization strategies in BigQuery.
Streaming Data: If the data source evolves, explore adding real-time streaming capabilities.

Final Thoughts

The tfl-data-visualization project exemplifies the best practices of data engineering: clarity, automation, scalability, and actionable business insights. Peer reviews like this not only celebrate what works but also help teams identify opportunities to make great projects even better. Special thanks DataTalks Club for this learning opportunity.

Peer Review 2: TfL Station Footfall Data Analysis Pipeline (Part 1)

Pizofreude — Thu, 01 May 2025 15:50:24 +0000

Peer reviews are a cornerstone of building high-quality data engineering projects. They don’t just help catch bugs and inefficiencies—they unlock opportunities for improvement, learning, and robust collaboration. In this two-part series, I’m diving into a peer review of the tfl-data-visualization project, which leverages public Transport for London (TfL) Oyster card data to uncover insights about passenger flows across London’s extensive rail network.

In Part 1, we’ll focus on the project’s foundation: the problem it tackles, its cloud-native architecture, and the orchestration of data ingestion. The goal is to demonstrate how a senior-level data engineering project is structured, documented, and automated for real-world impact.

Project Overview: What is tfl-data-visualization?

This project is a modern data engineering pipeline designed to analyze footfall data from London Tube and TfL Rail stations. By using open data on Oyster card tap-ins and tap-outs, the project enables granular analysis of how passengers move through the city’s transport network. The end product is a Looker Studio dashboard powered by data pipelines that automate everything from raw data ingestion to warehouse transformations.

1. Problem Description

A well-defined business problem is the first step towards a meaningful solution. This project excels here. The README clearly articulates:

The business context: Understanding passenger flows can help with optimizing station management, reducing congestion, and supporting infrastructure decisions.
The data source: Publicly available TfL Oyster card tap count data.
Project goals: Automate data collection, processing, and visualization to enable data-driven insights for stakeholders.

Review Comment:

Excellent articulation of the problem and its real-world significance. The clarity helps the reader quickly understand the project’s goals and value.

2. Cloud Infrastructure and IaC

Modern data engineering projects are built for the cloud, and this project demonstrates that ethos. The pipeline is developed for Google Cloud Platform (GCP), featuring:

BigQuery as the data warehouse: Scalable, cost-efficient, and optimized for analytics.
Google Cloud Storage (GCS) for raw data: Centralized, secure cloud storage for source files.
Infrastructure as Code (IaC) with Terraform: All GCP resources are provisioned automatically, ensuring repeatability and minimizing manual setup.

Review Comment:

Outstanding use of cloud technologies and automation. Leveraging Terraform for GCP infra shows strong cloud engineering practice.

3. Data Ingestion: Batch Processing and Workflow Orchestration

Automation is at the heart of reliability and scalability. The project uses Kestra as its workflow orchestrator, building a robust, automated batch pipeline that covers:

Automated downloading of multi-year historical CSV data from TfL’s open data portal.
Uploading files to Google Cloud Storage.
Loading data into BigQuery, both as external tables and as native tables for partitioned, consolidated analysis.
DAG orchestration: Kestra flows define the data ingestion DAG, with scheduled weekly updates and modular subflows for maintainability.

Review Comment:

Great job with fully automated workflow orchestration. Consider including visual examples or screenshots of Kestra flows to further clarify the orchestration structure.

Wrapping Up Part 1

In this first part, we’ve established the strong foundation on which the tfl-data-visualization project is built: a clearly defined problem, cloud-native architecture, and automated data ingestion. These elements are critical for any ambitious data engineering initiative, ensuring not only technical excellence but also business relevance and operational scalability.

Stay tuned for Part 2, where we’ll dive deeper into data warehousing strategies, transformation logic with dbt, dashboarding, reproducibility, and overall peer review feedback.

Peer Review 1: Poland's Real Estate Market Dashboards and Insights with Streamlit (Part 2)

Pizofreude — Wed, 30 Apr 2025 16:17:37 +0000

Introduction

Welcome to the second part of Peer Review 1, where we continue exploring the data engineering project focused on analyzing Poland's real estate market. In the first post, we reviewed the problem description, batch data ingestion pipeline, and cloud setup using Kestra, BigQuery, and dbt Cloud.

In this post, we’ll dive into the Streamlit dashboard, data transformations, and the insights derived from the project. We'll also discuss future improvements and potential optimizations to enhance the project further.

Dashboard Implementation: Streamlit at the Forefront

Streamlit Overview

Streamlit is used to create a static dashboard that visualizes data trends and insights. The dashboard provides a clear overview of Poland's real estate market, focusing on rental and sales trends across various cities. It includes features such as:

Visualizations of market trends: Median and 95th percentile prices, city-wise activity, and price distributions.

Static data integration: Pre-processed CSV files are used to power the dashboard.

Transition to a Dynamic Dashboard (Planned)

While the current implementation is static, the project owner plans to enhance the dashboard by integrating dynamic and interactive features. This would allow users to:

Filter data by city, price range, and transaction type (rent/sale).
Interact with visualizations dynamically to explore trends and insights in real-time.

Data Transformations with dbt

Transforming Raw Data into Insights

The project uses dbt Cloud to transform raw data into analysis-ready tables. These transformations include:

Cleaning and standardizing raw CSV data.
Aggregating data by city, transaction type, and time period.
Calculating metrics like median prices, percentiles, and total listings.

Example SQL Models

Here’s an example of a dbt model that calculates city-level rental price trends:

WITH city_prices AS (
    SELECT
        city,
        transaction_type,
        price,
        COUNT(*) AS total_listings
    FROM {{ ref('raw_data') }}
    WHERE transaction_type = 'rent'
    GROUP BY city, transaction_type, price
)
SELECT
    city,
    AVG(price) AS avg_price,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY price) AS percentile_95_price,
    SUM(total_listings) AS total_rental_listings
FROM city_prices
GROUP BY city

This model ensures that the data used in the dashboard is well-structured and consistent.

Key Insights from the Data

1. Rental and Sales Trends

High-Activity Cities: Warsaw and Kraków consistently show higher rental and sales activity compared to smaller cities like Bydgoszcz and Szczecin.
Price Distributions: Median prices are significantly lower in smaller cities, while the 95th percentile prices indicate luxury market trends in larger cities.

2. Percentile-Based Trends

A line chart in the static dashboard compares median and 95th percentile prices for each city. The data highlights percentile trends in the real estate market, with rental activity peaking by the 95th percentile.

3. Total Listings vs Average Price Comparisons

These charts provide a visual representation of cities with high activity levels relative to average prices. Rental prices exhibit smaller fluctuations compared to total listings, which tend to be more volatile.

Future Improvements

1. Transition to a Dynamic Dashboard

The planned upgrade to an interactive and dynamic dashboard will provide users with real-time filtering and visualization capabilities.

2. Dynamic Data Updates

Integrating a streaming data pipeline could enable real-time updates for the dashboard, keeping it current with the latest market data.

3. Advanced Filtering

Adding more advanced filters (e.g., by property type, number of rooms) could enhance the user experience.

4. Predictive Analytics

Incorporating time-series forecasting models could provide users with future price trends and market predictions.

5. Optimization in BigQuery

Partitioning and clustering the BigQuery tables could significantly improve query performance for larger datasets.

Conclusion

This post reviewed the current static Streamlit dashboard, data transformations with dbt, and key insights derived from the project. While the static dashboard provides clear visualizations, the planned interactive upgrade will make it more dynamic and user-friendly. Additionally, future improvements like real-time updates and predictive analytics can further enhance the project's impact.

Peer Review 1: Analyzing Poland's Real Estate Market (Part 1)
Thanks for this learning opportunity DataTalks Club

Peer Review 1: Analyzing Poland's Real Estate Market (Part 1)

Pizofreude — Wed, 30 Apr 2025 15:42:28 +0000

Introduction

Welcome to the first part of Peer Review 1 for DTC DEZOOMCAMP. This two-part series provides an in-depth review of a data engineering pipeline designed to analyze Poland's real estate market. The project demonstrates the use of modern data engineering tools such as BigQuery, dbt Cloud, and Kestra, along with a Streamlit dashboard for visualization.

This post will focus on the problem description, data ingestion pipeline, and the cloud setup, while the next post will explore the interactive dashboard and insights.

Problem Description

The project aims to analyze Poland's real estate market, focusing on rental and sales trends across various cities. By processing and visualizing the data, the following questions are addressed:

Which cities have the highest rental or sales activity?
What are the price trends across different cities?
How does the real estate market vary between rentals and sales?

A dataset from Kaggle, containing apartment prices in Poland, serves as the starting point. This dataset includes details such as city names, transaction types (rent/sale), and prices. The primary challenge lies in transforming the raw CSV data into actionable insights while ensuring scalability and reproducibility.

Data Ingestion: Batch Processing with Kestra

Workflow Orchestration

The project employs Kestra for handling multiple CSV files and automating the ETL process. The workflow includes:

Data Extraction: CSV files containing raw real estate data are ingested into the pipeline.
Data Transformation: Kestra facilitates cleaning and structuring the data for analysis.
Data Loading: The cleaned data is loaded into both PostgreSQL (for local analysis) and BigQuery (for cloud-based analysis).

Why Kestra?

Kestra provides the ability to automate the entire ETL process, ensuring consistency and minimizing manual intervention. Although the dataset isn’t updated regularly, the pipeline is scalable and can handle new data efficiently.

Example Kestra Flow

An example Kestra flow processes the CSV files by:

Taking file paths and metadata (e.g., month and year) as input.
Executing tasks for data cleaning, validation, and loading.
Producing cleaned data as output in BigQuery and PostgreSQL.

Cloud Setup: BigQuery and dbt Cloud

BigQuery as the Data Warehouse

BigQuery serves as the data warehouse for storing and querying the transformed data. Its serverless architecture and scalability make it an excellent choice. Key features utilized include:

SQL Queries: Used to analyze price distributions, trends, and city-level activity.
Integration with dbt Cloud: Enables modular and reusable transformations.

Transformations with dbt Cloud

dbt Cloud is employed for data cleaning and structuring. It allows:

Writing modular SQL models.
Testing data integrity.
Creating curated tables with calculated fields like medians, percentiles, and trends.

Example dbt Configuration

Below is a snippet from the dbt_project.yml file:

name: 'polish_flats_dbt'
version: '1.0'
config-version: 2
profile: 'default'  # Use the default profile from profiles.yml
model-paths:
  - models

Challenges and Workarounds

Challenge: Streamlit occasionally failed due to sync delays from the US cluster of dbt Cloud.
Workaround: Pre-exported CSVs were used for local analysis, significantly improving performance and reliability.

Reproducibility

The README file provides detailed instructions for setting up the project locally. These include:

Setting up PostgreSQL and Kestra using Docker.
Installing dependencies for dbt and running transformations.
Configuring BigQuery and dbt Cloud for seamless integration.

Running Locally

The following steps can be followed to run the pipeline locally:

Clone the repository:

git clone <https://github.com/elgrassa/Data-engineering-professional-certificate.git>
cd Data-engineering-professional-certificate

Start PostgreSQL and Kestra using Docker:

docker-compose -p kestra-postgres up -d

Install dependencies:

pip install -r requirements.txt
pip install dbt-bigquery

Conclusion

This post reviewed the problem description, batch data ingestion pipeline with Kestra, and the cloud setup using BigQuery and dbt Cloud. These components form the backbone of the project, enabling efficient ETL processes and scalable storage.

The next post will delve into the Streamlit dashboard, visualizations, and insights derived from the data.

InsightFlow Part 9: Workflow Orchestration with Kestra

Pizofreude — Tue, 29 Apr 2025 07:31:55 +0000

9. Workflow Orchestration with Kestra

In modern data engineering, orchestrating workflows is a critical component of building reliable, scalable, and automated data pipelines. For the InsightFlow project, we leverage Kestra, an open-source declarative orchestration platform, to manage the end-to-end workflow of ingesting, transforming, and analyzing retail and economic data from public sources. This blog post will walk you through how Kestra is used in this project and why it is an excellent choice for workflow orchestration.

Why Kestra?

Kestra is a modern orchestration platform designed to simplify the management of complex workflows. It offers several features that make it ideal for the InsightFlow project:

Declarative Workflow Design: Workflows are defined in YAML, making them easy to read, version-control, and maintain.
Scalability: Kestra can handle large-scale workflows with hundreds of tasks, ensuring reliability even under heavy loads.
Extensibility: With over 600 plugins, Kestra supports a wide range of tasks, including AWS services, database queries, and custom scripts.
Observability: Kestra provides detailed logs, metrics, and monitoring tools to track workflow execution and troubleshoot issues.
Integration with Modern Tools: Kestra integrates seamlessly with Git, Terraform, and other tools, enabling a streamlined CI/CD pipeline.

Kestra in the InsightFlow Project

In the InsightFlow project, Kestra orchestrates the following key workflows:

Data Ingestion: Fetching raw data from public sources using AWS Batch.
Data Transformation: Running dbt models to clean, normalize, and structure the data.
Data Cataloging: Updating the AWS Glue Data Catalog to reflect the latest data.
Testing and Validation: Running dbt tests to ensure data quality.
Scheduling and Automation: Automating the entire pipeline to run on a daily schedule.

Workflow Overview

The Kestra workflow for the production environment is defined in the file insightflow_prod_pipeline.yml. Below is an overview of the key tasks:

1. Data Ingestion via AWS Batch

The workflow starts by submitting an AWS Batch job to ingest raw data from public sources into the S3 bucket insightflow-prod-raw-data. This is achieved using the following task:

- id: submit_batch_ingestion_job_cli
  type: io.kestra.core.tasks.scripts.Bash
  commands:
    - |
      echo "Submitting AWS Batch Job..."
      JOB_DEF_NAME="insightflow-prod-ingestion-job-def"
      JOB_QUEUE_NAME="insightflow-prod-job-queue"
      TARGET_BUCKET_NAME="insightflow-prod-raw-data"
      AWS_REGION="ap-southeast-2"

      JOB_NAME="insightflow-ingestion-{{execution.id}}"
      JOB_OUTPUT=$(aws batch submit-job \\
        --region "$AWS_REGION" \\
        --job-name "$JOB_NAME" \\
        --job-queue "$JOB_QUEUE_NAME" \\
        --job-definition "$JOB_DEF_NAME" \\
        --container-overrides '{
            "environment": [
              {"name": "TARGET_BUCKET", "value": "'"$TARGET_BUCKET_NAME"'"}
            ]
          }')

      JOB_ID=$(echo "$JOB_OUTPUT" | grep -o '"jobId": "[^"]*' | awk -F'"' '{print $4}')
      echo "Submitted Job ID: $JOB_ID"

2. Updating the Glue Data Catalog

Once the raw data is ingested, the workflow triggers an AWS Glue Crawler to update the Glue Data Catalog. This ensures that the latest data is available for querying in Athena.

- id: start_glue_crawler_cli
  type: io.kestra.core.tasks.scripts.Bash
  commands:
    - |
      echo "Starting AWS Glue Crawler..."
      CRAWLER_NAME="insightflow-prod-raw-data-crawler"
      AWS_REGION="ap-southeast-2"

      aws glue start-crawler --region $AWS_REGION --name "$CRAWLER_NAME"
      echo "Crawler $CRAWLER_NAME started."

3. Running dbt Models

After the data is cataloged, the workflow runs dbt models to transform the raw data into an analysis-ready format. This includes tasks for syncing dbt files, installing dependencies, and running the models.

- id: dbt_run
  type: io.kestra.plugin.dbt.cli.DbtCLI
  commands:
    - dbt run --target prod
    namespaceFiles:
      enabled: false
    containerImage: pizofreude/kestra-dbt-athena:latest

4. Testing and Validation

To ensure data quality, the workflow runs dbt tests on the transformed data. Any issues are logged for further investigation.

- id: dbt_test
  type: io.kestra.plugin.dbt.cli.DbtCLI
  commands:
    - dbt test --target prod
    namespaceFiles:
      enabled: false
    containerImage: pizofreude/kestra-dbt-athena:latest

5. Scheduling

The workflow is scheduled to run daily at 5:00 AM UTC using Kestra's scheduling feature.

triggers:
  - id: daily_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 5 * * *"

Benefits of Using Kestra

Automation: Kestra automates the entire pipeline, reducing manual intervention and ensuring consistency.
Error Handling: With built-in retry mechanisms and detailed logs, Kestra makes it easy to identify and resolve issues.
Scalability: Kestra can handle large-scale workflows with multiple tasks and dependencies.
Flexibility: The declarative YAML syntax allows for easy customization and extension of workflows.

Getting Started with Kestra

To set up Kestra for your own projects, follow these steps:

Install Kestra: Refer to the Kestra documentation for installation instructions.
Define Workflows: Create YAML files to define your workflows, as shown in the examples above.
Run Workflows: Use the Kestra UI or CLI to execute and monitor your workflows.
Integrate with CI/CD: Use Git and Terraform to version-control and deploy your workflows.

Conclusion

Kestra is a powerful tool for orchestrating workflows in modern data pipelines. In the InsightFlow project, it plays a crucial role in automating the ingestion, transformation, and validation of retail and economic data. By leveraging Kestra's features, we ensure that the pipeline is reliable, scalable, and easy to maintain.

If you're building a similar project, consider using Kestra to simplify your workflow orchestration. For more details, check out the Kestra documentation or explore the InsightFlow repository.

Happy orchestrating!

InsightFlow Part 8: Setting Up AWS Athena for Data Analysis in InsightFlow

Pizofreude — Tue, 29 Apr 2025 03:27:07 +0000

InsightFlow GitHub Repo

In this post, we’ll explore how Amazon Athena was set up for querying and analyzing data in the InsightFlow project. Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It’s an essential component of the InsightFlow pipeline, enabling efficient querying of both raw and transformed data.

Why Amazon Athena?

Amazon Athena is an ideal choice for InsightFlow due to its:

Serverless Architecture: No infrastructure to manage; you only pay for the queries you run.
Seamless Integration with S3: Queries data directly from S3 without requiring ETL processes to move data elsewhere.
Support for Open Formats: Works with Parquet, ORC, JSON, and other formats, ensuring compatibility with the data pipeline.
Partitioning and Optimization: Supports partitioning and compression to reduce query costs and improve performance.

For InsightFlow, Athena is used to:

Query raw and processed data stored in S3.
Analyze trends in retail sales and fuel prices.
Serve as the backend for dashboards in AWS QuickSight.

Step 1: Preparing the Data in S3

The data pipeline stores both raw and transformed data in S3 buckets:

Raw Data: Stored in the insightflow-prod-raw-data bucket under the raw/ prefix.
Processed Data: Stored in the insightflow-prod-processed-data bucket under the processed/ prefix.

Partitioning the Data

To optimize query performance, the processed data is partitioned by year and month. For example:

s3://insightflow-prod-processed-data/fct_retail_sales_monthly/year=2025/month=04/

Partitioning allows Athena to scan only the relevant data, reducing query costs and improving performance.

Step 2: Setting Up the Glue Data Catalog

Athena relies on the AWS Glue Data Catalog to store metadata about the datasets. Glue Crawlers were used to automatically discover schemas and populate the Data Catalog.

Glue Crawler Configuration

The Glue Crawler scans the processed S3 bucket and creates tables in the Glue database:

resource "aws_glue_crawler" "processed_data_crawler" {
  name          = "insightflow-prod-processed-data-crawler"
  role          = aws_iam_role.glue_crawler_role.arn
  database_name = aws_glue_catalog_database.dbt_database.name

  s3_target {
    path = "s3://insightflow-prod-processed-data/processed/"
  }

  schema_change_policy {
    update_behavior = "UPDATE_IN_DATABASE"
    delete_behavior = "LOG"
  }

  tags = {
    Environment = "prod"
    Project     = "InsightFlow"
  }
}

Once the crawler is run, the processed data is available as tables in the Glue Data Catalog.

Step 3: Configuring Athena

Creating a Workgroup

Athena Workgroups help manage query costs and monitor usage. A workgroup was created for the project:

resource "aws_athena_workgroup" "insightflow_workgroup" {
  name = "insightflow-prod-workgroup"

  configuration {
    enforce_workgroup_configuration = true
    publish_cloudwatch_metrics_enabled = true
    result_configuration {
      output_location = "s3://insightflow-prod-processed-data/athena-results/"
    }
  }

  tags = {
    Environment = "prod"
    Project     = "InsightFlow"
  }
}

Setting Query Results Location

Athena query results are stored in the athena-results/ prefix of the processed S3 bucket:

s3://insightflow-prod-processed-data/athena-results/

This ensures that query results are accessible for debugging and downstream processing.

Step 4: Querying Data with Athena

Once the Glue Crawler has populated the Data Catalog, the data can be queried using SQL in the Athena console or programmatically via the AWS CLI or SDK.

Example Query: Analyzing Retail Sales and Fuel Prices

The following query analyzes the correlation between retail sales and fuel prices:

SELECT
    r.year,
    r.month,
    r.sales_value_rm_mil,
    f.avg_ron95_price,
    f.avg_ron97_price,
    f.avg_diesel_price
FROM
    fct_retail_sales_monthly r
JOIN
    fuelprice_monthly f
ON
    r.year = f.year AND r.month = f.month
WHERE
    r.year = 2025
ORDER BY
    r.month;

This query joins the fct_retail_sales_monthly fact table with the fuelprice_monthly table to analyze trends.

Step 5: Optimizing Athena Queries

1. Use Partitioning

Partitioning the data by year and month ensures that Athena scans only the relevant partitions, reducing query costs.

2. Use Parquet Format

The data is stored in Parquet format, which is optimized for analytical queries due to its columnar storage and compression.

3. Limit Data Scanned

Use SELECT statements to query only the required columns and apply filters (e.g., WHERE year = 2025) to minimize the amount of data scanned.

4. Monitor Query Costs

Athena Workgroups provide metrics in CloudWatch to monitor query costs and performance.

Step 6: Integrating Athena with QuickSight

Athena serves as the backend for dashboards in AWS QuickSight. QuickSight connects to Athena using the Glue Data Catalog, enabling interactive visualizations of retail sales and fuel price trends.

Challenges and Lessons Learned

Schema Evolution: Managing schema changes in Glue required careful configuration of the schema_change_policy.
Partitioning Strategy: Choosing the right partitioning strategy was critical for optimizing query performance.
Cost Management: Monitoring query costs in Athena Workgroups helped identify and optimize expensive queries.

Conclusion

Amazon Athena is a powerful tool for querying and analyzing data directly in S3. By integrating Athena with the Glue Data Catalog and optimizing the data layout, InsightFlow enables efficient, cost-effective data analysis.

InsightFlow Part 7: Data Quality Implementation & Best Practices for InsightFlow

Pizofreude — Tue, 29 Apr 2025 03:04:01 +0000

InsightFlow GitHub Repo

In this post, we’ll explore how data quality was implemented in the InsightFlow project and share best practices for ensuring reliable and accurate data pipelines. Data quality is a critical aspect of any data engineering project, as it ensures that the insights derived from the data are trustworthy and actionable.

Why Data Quality Matters

Data quality directly impacts the reliability of analytics and decision-making. Poor data quality can lead to:

Inaccurate Insights: Misleading trends and correlations.
Operational Inefficiencies: Wasted time debugging and fixing issues downstream.
Loss of Trust: Stakeholders losing confidence in the data.

For InsightFlow, ensuring data quality was essential to accurately analyze retail sales trends and their correlation with fuel prices.

Data Quality Framework for InsightFlow

The data quality framework for InsightFlow was implemented at multiple stages of the pipeline, from ingestion to transformation and analysis. Below are the key components:

1. Data Validation During Ingestion

The ingestion layer, implemented using AWS Batch, includes basic validation checks to ensure the raw data meets expected formats and structures.

Validation Steps

File Format Validation: Ensures that ingested files are in the expected format (e.g., Parquet or CSV).
Schema Validation: Confirms that the files contain the required columns with the correct data types.
Null Checks: Flags missing or null values in critical columns.

Example: Python Validation Script

def validate_data(df, required_columns):
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    if df.isnull().any().any():
        raise ValueError("Null values detected in the dataset.")

2. Data Quality in Transformation (dbt)

The transformation layer, implemented using dbt, includes robust data quality checks through schema tests and custom tests.

Schema Tests

Schema tests ensure that the data adheres to predefined rules. For example:

Not Null: Ensures critical columns (e.g., sales_value_rm_mil) are not null.
Unique: Ensures unique values in primary key columns (e.g., date_key).
Relationships: Validates foreign key relationships between fact and dimension tables.

Example: Schema Test for `fct_retail_sales_monthly`

models:
  - name: fct_retail_sales_monthly
    description: "Monthly fact table combining retail sales data and average fuel prices."
    columns:
      - name: date_key
        description: "Foreign key to the date dimension."
        tests:
          - not_null
          - relationships:
              to: ref('dim_date')
              field: date_key
      - name: sales_value_rm_mil
        description: "Monthly sales value in RM millions."
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "sales_value_rm_mil >= 0"

Custom Tests

Custom tests were implemented using the dbt-utils package to validate business-specific rules. For example:

Price Range Validation: Ensures fuel prices are within a reasonable range.
Volume Index Validation: Ensures volume indices are non-negative.

3. Monitoring and Alerts

To ensure ongoing data quality, monitoring and alerting mechanisms were implemented using CloudWatch and Kestra.

CloudWatch Metrics

Glue Crawler Logs: Monitors schema changes and ingestion errors.
Athena Query Logs: Tracks query performance and errors.

Kestra Workflow Alerts

Kestra workflows include error handling and notifications for failed tasks. For example:

If a Glue Crawler fails, an alert is sent to the team via email or Slack.
If a dbt test fails, the pipeline halts, and the issue is logged for debugging.

Best Practices for Data Quality

1. Define Clear Data Quality Rules

Collaborate with stakeholders to define rules for each dataset (e.g., required columns, valid ranges).
Document these rules in a central repository for easy reference.

2. Automate Data Quality Checks

Use tools like dbt to automate schema and custom tests.
Integrate validation scripts into the ingestion pipeline.

3. Monitor Data Quality Continuously

Set up dashboards to monitor key metrics (e.g., null values, schema changes).
Use alerts to notify the team of issues in real time.

4. Handle Data Quality Issues Proactively

Implement retry mechanisms for transient errors (e.g., network issues during ingestion).
Log all data quality issues for auditing and debugging.

5. Test Data Quality Regularly

Schedule regular tests to ensure data quality rules are enforced.
Use historical data to validate new rules and identify anomalies.

Challenges and Lessons Learned

Schema Evolution: Managing schema changes in Glue required careful configuration of the schema_change_policy.
Custom Tests: Writing custom tests for business-specific rules required collaboration with domain experts.
Alert Fatigue: Fine-tuning alerts was necessary to avoid overwhelming the team with non-critical notifications.

Conclusion

Implementing robust data quality practices is essential for building reliable data pipelines. By integrating validation checks, schema tests, and monitoring mechanisms, InsightFlow ensures that its data is accurate, consistent, and trustworthy. These practices not only improve the quality of insights but also build confidence among stakeholders.

InsightFlow Part 6: Implementing ETL Processes with AWS Glue for InsightFlow

Pizofreude — Tue, 29 Apr 2025 02:44:42 +0000

InsightFlow GitHub Repo

In this post, we’ll explore how AWS Glue was used to implement the ETL (Extract, Transform, Load) processes for the InsightFlow project. AWS Glue provides a serverless, fully managed environment for building and running ETL pipelines, making it an ideal choice for transforming raw data into a structured, queryable format.

Why AWS Glue?

AWS Glue simplifies the process of building ETL pipelines by offering the following key features:

Serverless Architecture: No need to manage infrastructure; Glue automatically provisions resources.
Schema Discovery: Automatically detects and catalogs data schemas using the Glue Data Catalog.
Integration with AWS Services: Seamlessly integrates with S3, Athena, and other AWS services.
Scalability: Automatically scales to handle large datasets.
Cost Efficiency: Pay only for the resources used during ETL jobs.

For InsightFlow, AWS Glue was used to:

Discover and catalog raw data stored in S3.
Transform raw data into a structured format.
Load the transformed data into a partitioned data warehouse layer in S3.

Overview of the ETL Workflow

The ETL process in InsightFlow involves the following steps:

Extract: Fetch raw data from S3 buckets.
Transform: Clean, normalize, and enrich the data using Glue jobs.
Load: Write the transformed data back to S3 in a partitioned format for efficient querying with Athena.

Key Components

Glue Data Catalog: Stores metadata about the raw and transformed datasets.
Glue Crawlers: Automatically discover schemas and update the Data Catalog.
Glue Jobs: Perform the actual data transformations.

Step 1: Setting Up the Glue Data Catalog

The Glue Data Catalog acts as a central repository for metadata about the datasets. It enables Athena to query the data without requiring explicit schema definitions.

Defining the Glue Database

A Glue database was created to organize the tables for the project. Here’s the Terraform configuration:

resource "aws_glue_catalog_database" "dbt_database" {
  name = "insightflow_prod"
  tags = {
    Environment = "prod"
    Project     = "InsightFlow"
  }
}

Step 2: Discovering Data with Glue Crawlers

Glue Crawlers were used to automatically discover the schema of raw data stored in S3 and populate the Data Catalog.

Configuring the Glue Crawler

The crawler scans the raw S3 bucket and creates tables in the Glue database:

resource "aws_glue_crawler" "raw_data_crawler" {
  name          = "insightflow-prod-raw-data-crawler"
  role          = aws_iam_role.glue_crawler_role.arn
  database_name = aws_glue_catalog_database.dbt_database.name

  s3_target {
    path = "s3://insightflow-prod-raw-data/raw/"
  }

  schema_change_policy {
    update_behavior = "UPDATE_IN_DATABASE"
    delete_behavior = "LOG"
  }

  tags = {
    Environment = "prod"
    Project     = "InsightFlow"
  }
}

Running the Crawler

The crawler is triggered using the AWS CLI or programmatically via the AWS SDK:

aws glue start-crawler --name insightflow-prod-raw-data-crawler

Once the crawler completes, the raw data is available as tables in the Glue Data Catalog.

Step 3: Transforming Data with Glue Jobs

Glue jobs were used to clean, normalize, and enrich the raw data. These jobs are written in PySpark, allowing for scalable, distributed data processing.

Example Glue Job: Aggregating Fuel Prices

The following Glue job aggregates weekly fuel prices into monthly averages:

import sys
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, year, month

# Initialize Spark session
sc = SparkContext()
spark = SparkSession(sc)

# Read raw fuel price data from S3
raw_data = spark.read.format("parquet").load("s3://insightflow-prod-raw-data/raw/fuelprice/")

# Aggregate weekly prices to monthly averages
monthly_data = raw_data.groupBy(year("ymd_date").alias("year"), month("ymd_date").alias("month")) \\
    .agg(
        avg("ron95").alias("avg_ron95_price"),
        avg("ron97").alias("avg_ron97_price"),
        avg("diesel").alias("avg_diesel_price")
    )

# Write the transformed data back to S3
monthly_data.write.partitionBy("year", "month").parquet("s3://insightflow-prod-processed-data/fuelprice_monthly/")

Scheduling Glue Jobs

Glue jobs can be scheduled to run at regular intervals using Glue Triggers or external orchestration tools like Kestra.

Step 4: Querying Transformed Data with Athena

The transformed data is stored in a partitioned format in the processed S3 bucket. Athena queries can leverage these partitions for efficient data retrieval.

Example Query

Here’s an example query to analyze the correlation between retail sales and fuel prices:

SELECT
    r.year,
    r.month,
    r.sales_value_rm_mil,
    f.avg_ron95_price,
    f.avg_ron97_price,
    f.avg_diesel_price
FROM
    fct_retail_sales_monthly r
JOIN
    fuelprice_monthly f
ON
    r.year = f.year AND r.month = f.month
WHERE
    r.year = 2025
ORDER BY
    r.month;

Challenges and Lessons Learned

Schema Evolution: Managing schema changes in Glue required careful configuration of the schema_change_policy.
Partitioning: Proper partitioning of the transformed data significantly improved query performance in Athena.
IAM Permissions: Ensuring the Glue job role had the necessary permissions to access S3 and the Data Catalog was critical.

InsightFlow Part 5: Designing the Data Model & Schema with dbt for InsightFlow

Pizofreude — Tue, 29 Apr 2025 02:30:49 +0000

InsightFlow GitHub Repo

In this post, we’ll dive into how the data model and schema for the InsightFlow project were designed using dbt (Data Build Tool). This layer is critical for transforming raw data into a structured, analysis-ready format that supports efficient querying and visualization. We’ll also explore the Entity-Relationship Diagram (ERD) for the project, which provides a visual representation of the relationships between the key entities in the data model.

Why dbt for Data Modeling?

dbt is a powerful tool for transforming raw data into a structured format using SQL. It enables data engineers to:

Standardize Transformations: Define reusable SQL models for data cleaning, normalization, and enrichment.
Version Control: Manage transformations as code in Git for collaboration and reproducibility.
Test Data Quality: Add tests to ensure data integrity at every stage of the pipeline.
Optimize for Querying: Materialize models as views or tables, partitioned and optimized for querying in AWS Athena.

For InsightFlow, dbt was the ideal choice to transform raw retail and fuel price data into a star schema that supports analysis of trends and correlations.

Overview of the Data Model

The InsightFlow data model is designed as a star schema, with the following key components:

Fact Table: Contains quantitative metrics, such as sales values and fuel prices.
Dimension Tables: Provide descriptive attributes, such as MSIC group codes and dates, to slice and dice the data.

Key Tables

Fact Table: fct_retail_sales_monthly
- Metrics: Sales values, volume indices, fuel prices.
- Partitioned by: Year and month for efficient querying in Athena.
Dimension Tables:
- dim_msic_lookup: Provides descriptions for MSIC group codes.
- dim_date: A date dimension table for time-based analysis.

Step 1: Defining Sources

The raw data ingested into the landing zone (S3 bucket) is defined as sources in dbt. These sources are created by the Glue Crawler and include:

iowrt: Headline wholesale and retail trade data.
iowrt_3d: Detailed wholesale and retail trade data by MSIC group.
fuelprice: Weekly fuel price data.

Here’s how the sources are defined in sources.yml:

version: 2

sources:
  - name: landing_zone
    schema: insightflow_prod
    description: "Raw data loaded from data.gov.my sources via AWS Batch ingestion."
    tables:
      - name: iowrt
        description: "Raw Headline Wholesale & Retail Trade data (monthly)."
        columns:
          - name: series
            description: "Series type ('abs', 'growth_yoy', 'growth_mom')"
          - name: ymd_date
            description: "Date of record (YYYY-MM-DD, monthly frequency)"
          - name: sales
            description: "Sales Value (RM mil)"
          - name: volume
            description: "Volume Index (base 2015 = 100)"
      - name: fuelprice
        description: "Raw weekly fuel price data."
        columns:
          - name: ron95
            description: "RON95 Price (RM/litre)"
          - name: ron97
            description: "RON97 Price (RM/litre)"
          - name: diesel
            description: "Diesel Price (RM/litre)"

Step 2: Creating Staging Models

Staging models clean and standardize the raw data. For example, the stg_iowrt.sql model filters for absolute values and casts columns to appropriate data types:

with source_data as (
    select
        series,
        cast(ymd_date as date) as record_date,
        sales,
        volume
    from {{ source('landing_zone', 'iowrt') }}
    where series = 'abs'
)

select
    record_date,
    cast(sales as double) as sales_value_rm_mil,
    cast(volume as double) as volume_index
from source_data

Step 3: Building the Fact Table

The fact table, fct_retail_sales_monthly, combines data from multiple sources (e.g., retail sales and fuel prices) into a single table. It is partitioned by year and month for efficient querying in Athena.

Here’s the configuration in dbt_project.yml:

fct_retail_sales_monthly:
  +materialized: table
  +partitions:
    - year
    - month

Step 4: Adding Dimension Tables

1. MSIC Lookup Dimension

The dim_msic_lookup table provides descriptions for MSIC group codes. It is created from a seed file (msic_lookup.csv):

seeds:
  insightflow:
    msic_lookup:
      +schema: raw_seeds
      +file_format: parquet
      +column_types:
        group_code: varchar
        desc_en: varchar
        desc_bm: varchar

2. Date Dimension

The dim_date table is a standard date dimension table that supports time-based analysis.

Step 5: Testing and Documentation

dbt allows you to add tests to ensure data quality. For example, you can test that the sales column in the fact table is not null:

tests:
  - not_null
  - accepted_values:
      values: [abs, growth_yoy, growth_mom]

Additionally, dbt automatically generates documentation for your models, which can be viewed in a browser.

Entity-Relationship Diagram (ERD)

Here’s the ERD for the InsightFlow data model:

You can copy this diagram into dbdiagram.io to visualize the relationships.

Conclusion

By leveraging dbt, we transformed raw data into a structured, analysis-ready format. The star schema design ensures efficient querying and supports a wide range of analyses, from sales trends to fuel price correlations.

DEV Community: Pizofreude

dlt MCP Server for Popular IDEs

Overview

Prerequisites

Setup

1. Configure MCP Server

VS Code

Cursor

Kiro

Claude Desktop

Other IDEs (e.g., PyCharm, IntelliJ, Sublime Text)

Usage

1. Test MCP Server

2. Inspect Pipeline Data

3. Validate Data

4. Agentic Help

Troubleshooting

Conclusion

Peer Review 3: France Data Engineering Job Market Transformations, Visualization, and Feedback (Part 2)

Introduction

1. Transformations with dbt

2. Data Warehouse Design

3. Dashboarding & Data Products

4. Reproducibility & Documentation

5. Actionable Feedback & Areas for Growth

Conclusion:

Peer Review 3: France Data Engineering Job Market Analysis Pipeline Infra (Part 1)

Introduction

Project Overview

Evaluation Criteria: How I Review

1. Problem Description

2. Cloud Infrastructure & IaC

What Stands Out

3. Workflow Orchestration: Batch Pipelines with Kestra

How It’s Used

Batch vs. Streaming

4. Data Ingestion: Batch (and What About Streaming?)

5. Contents of Interest (Project Structure Highlights)

Conclusion & What’s Next

Peer Review 2: Data Warehousing, Transformation, and Reproducibility in tfl-data-visualization (Part 2)

4. Data Warehouse: BigQuery Partitioning for Scalable Analytics

5. Transformations with dbt

6. Dashboarding: Visualizing Insights with Looker Studio

7. Reproducibility: Clear, Actionable Documentation

8. Summary of Feedback and Recommendations

Final Thoughts

Peer Review 2: TfL Station Footfall Data Analysis Pipeline (Part 1)

Project Overview: What is tfl-data-visualization?

1. Problem Description

2. Cloud Infrastructure and IaC

3. Data Ingestion: Batch Processing and Workflow Orchestration

Wrapping Up Part 1

Peer Review 1: Poland's Real Estate Market Dashboards and Insights with Streamlit (Part 2)

Introduction

Dashboard Implementation: Streamlit at the Forefront

Streamlit Overview

Transition to a Dynamic Dashboard (Planned)

Data Transformations with dbt

Transforming Raw Data into Insights

Example SQL Models

Key Insights from the Data

1. Rental and Sales Trends

2. Percentile-Based Trends

3. Total Listings vs Average Price Comparisons

Future Improvements

1. Transition to a Dynamic Dashboard

2. Dynamic Data Updates

3. Advanced Filtering

4. Predictive Analytics

5. Optimization in BigQuery

Conclusion

Related Posts

Peer Review 1: Analyzing Poland's Real Estate Market (Part 1)

Introduction

Problem Description

Data Ingestion: Batch Processing with Kestra

Workflow Orchestration

Why Kestra?

Example Kestra Flow

Cloud Setup: BigQuery and dbt Cloud

Example: Schema Test for `fct_retail_sales_monthly`