Introduction
Welcome to the first part of Peer Review 1 for DTC DEZOOMCAMP. This two-part series provides an in-depth review of a data engineering pipeline designed to analyze Poland's real estate market. The project demonstrates the use of modern data engineering tools such as BigQuery, dbt Cloud, and Kestra, along with a Streamlit dashboard for visualization.
This post will focus on the problem description, data ingestion pipeline, and the cloud setup, while the next post will explore the interactive dashboard and insights.
Problem Description
The project aims to analyze Poland's real estate market, focusing on rental and sales trends across various cities. By processing and visualizing the data, the following questions are addressed:
- Which cities have the highest rental or sales activity?
- What are the price trends across different cities?
- How does the real estate market vary between rentals and sales?
A dataset from Kaggle, containing apartment prices in Poland, serves as the starting point. This dataset includes details such as city names, transaction types (rent/sale), and prices. The primary challenge lies in transforming the raw CSV data into actionable insights while ensuring scalability and reproducibility.
Data Ingestion: Batch Processing with Kestra
Workflow Orchestration
The project employs Kestra for handling multiple CSV files and automating the ETL process. The workflow includes:
- Data Extraction: CSV files containing raw real estate data are ingested into the pipeline.
- Data Transformation: Kestra facilitates cleaning and structuring the data for analysis.
- Data Loading: The cleaned data is loaded into both PostgreSQL (for local analysis) and BigQuery (for cloud-based analysis).
Why Kestra?
Kestra provides the ability to automate the entire ETL process, ensuring consistency and minimizing manual intervention. Although the dataset isnβt updated regularly, the pipeline is scalable and can handle new data efficiently.
Example Kestra Flow
An example Kestra flow processes the CSV files by:
- Taking file paths and metadata (e.g., month and year) as input.
- Executing tasks for data cleaning, validation, and loading.
- Producing cleaned data as output in BigQuery and PostgreSQL.
Cloud Setup: BigQuery and dbt Cloud
BigQuery as the Data Warehouse
BigQuery serves as the data warehouse for storing and querying the transformed data. Its serverless architecture and scalability make it an excellent choice. Key features utilized include:
- SQL Queries: Used to analyze price distributions, trends, and city-level activity.
- Integration with dbt Cloud: Enables modular and reusable transformations.
Transformations with dbt Cloud
dbt Cloud is employed for data cleaning and structuring. It allows:
- Writing modular SQL models.
- Testing data integrity.
- Creating curated tables with calculated fields like medians, percentiles, and trends.
Example dbt Configuration
Below is a snippet from the dbt_project.yml
file:
name: 'polish_flats_dbt'
version: '1.0'
config-version: 2
profile: 'default' # Use the default profile from profiles.yml
model-paths:
- models
Challenges and Workarounds
- Challenge: Streamlit occasionally failed due to sync delays from the US cluster of dbt Cloud.
- Workaround: Pre-exported CSVs were used for local analysis, significantly improving performance and reliability.
Reproducibility
The README file provides detailed instructions for setting up the project locally. These include:
- Setting up PostgreSQL and Kestra using Docker.
- Installing dependencies for dbt and running transformations.
- Configuring BigQuery and dbt Cloud for seamless integration.
Running Locally
The following steps can be followed to run the pipeline locally:
-
Clone the repository:
git clone <https://github.com/elgrassa/Data-engineering-professional-certificate.git> cd Data-engineering-professional-certificate
-
Start PostgreSQL and Kestra using Docker:
docker-compose -p kestra-postgres up -d
-
Install dependencies:
pip install -r requirements.txt pip install dbt-bigquery
Conclusion
This post reviewed the problem description, batch data ingestion pipeline with Kestra, and the cloud setup using BigQuery and dbt Cloud. These components form the backbone of the project, enabling efficient ETL processes and scalable storage.
The next post will delve into the Streamlit dashboard, visualizations, and insights derived from the data.
Top comments (0)