<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pizofreude</title>
    <description>The latest articles on DEV Community by Pizofreude (@pizofreude).</description>
    <link>https://dev.to/pizofreude</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F971308%2F0a84c6c6-2fcf-470a-9050-d90c267dfccb.png</url>
      <title>DEV Community: Pizofreude</title>
      <link>https://dev.to/pizofreude</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pizofreude"/>
    <language>en</language>
    <item>
      <title>dlt MCP Server for Popular IDEs</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Wed, 18 Feb 2026 14:42:17 +0000</pubDate>
      <link>https://dev.to/pizofreude/dlt-mcp-server-for-popular-ides-2e9b</link>
      <guid>https://dev.to/pizofreude/dlt-mcp-server-for-popular-ides-2e9b</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This demo showcases how to set up and use the &lt;strong&gt;dlt MCP Server&lt;/strong&gt; for data pipeline validation and inspection. The MCP server enables interactive querying and management of dlt pipelines, including data inspection, row counts, and load validation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UV&lt;/strong&gt; installed on your local machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dlt workspace&lt;/strong&gt; installed and configured.&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;pyproject.toml&lt;/code&gt; file with the necessary dependencies.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Configure MCP Server
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;VS Code&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;strong&gt;VS Code&lt;/strong&gt; and access &lt;strong&gt;Settings&lt;/strong&gt; (&lt;code&gt;Command+Shift+P&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Tools &amp;gt; MCP&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Custom MCP&lt;/strong&gt; to create/open the &lt;code&gt;mcp.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;Add the configuration for the &lt;strong&gt;dlt MCP Server&lt;/strong&gt; to &lt;code&gt;mcp.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt pipeline ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"--with"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duckdb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Ensure you include the &lt;code&gt;duckdb&lt;/code&gt; dependency if using a DuckDB destination.

&lt;ol&gt;
&lt;li&gt;Save the file. The MCP server will automatically update within a few seconds.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Cursor&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;strong&gt;Cursor&lt;/strong&gt; and access &lt;strong&gt;Settings&lt;/strong&gt; (&lt;code&gt;Command+,&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Extensions &amp;gt; MCP&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Custom MCP&lt;/strong&gt; to create/open the &lt;code&gt;mcp.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;Add the configuration for the &lt;strong&gt;dlt MCP Server&lt;/strong&gt; to &lt;code&gt;mcp.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt pipeline ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"--with"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duckdb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Save the file. The MCP server will automatically update within a few seconds.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Kiro&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;strong&gt;Kiro&lt;/strong&gt; and access &lt;strong&gt;Preferences&lt;/strong&gt; (&lt;code&gt;Command+,&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Plugins &amp;gt; MCP&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Custom MCP&lt;/strong&gt; to create/open the &lt;code&gt;mcp.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;Add the configuration for the &lt;strong&gt;dlt MCP Server&lt;/strong&gt; to &lt;code&gt;mcp.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt pipeline ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"--with"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duckdb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Save the file. The MCP server will automatically update within a few seconds.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Claude Desktop&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;strong&gt;Claude Desktop&lt;/strong&gt; and access &lt;strong&gt;Settings&lt;/strong&gt; (&lt;code&gt;Command+,&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Navigate to &lt;strong&gt;Integrations &amp;gt; MCP&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Add Custom MCP&lt;/strong&gt; to create/open the &lt;code&gt;mcp.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;Add the configuration for the &lt;strong&gt;dlt MCP Server&lt;/strong&gt; to &lt;code&gt;mcp.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt pipeline ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"--with"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duckdb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Save the file. The MCP server will automatically update within a few seconds.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Other IDEs (e.g., PyCharm, IntelliJ, Sublime Text)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Locate the &lt;strong&gt;MCP configuration&lt;/strong&gt; section in your IDE's settings.&lt;/li&gt;
&lt;li&gt;Create or open the &lt;code&gt;mcp.json&lt;/code&gt; file.&lt;/li&gt;
&lt;li&gt;Add the configuration for the &lt;strong&gt;dlt MCP Server&lt;/strong&gt; to &lt;code&gt;mcp.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dlt pipeline ..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"--with"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duckdb"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Save the file. The MCP server will automatically update within a few seconds.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Test MCP Server
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Open a chat in your IDE and ask:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  What pipelines are available?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The MCP server should list the available pipelines (e.g., GitHub pipeline).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Inspect Pipeline Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ask:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  What tables are in this pipeline?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The server will list tables (e.g., &lt;code&gt;commits&lt;/code&gt;, &lt;code&gt;contributors&lt;/code&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ask:
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  When was the data last loaded?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The server will provide the timestamp of the last data load.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Validate Data
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ask:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  How many rows are in the commits table?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;If the MCP server lacks dependencies (e.g., &lt;code&gt;duckdb&lt;/code&gt;), it will throw an error. Update the &lt;code&gt;mcp.json&lt;/code&gt; configuration to include the missing dependency and retry.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Agentic Help
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ask:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  How many rows will be extracted in the next run in commits?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The MCP server will analyze the pipeline and confirm if incremental loading is applied. If not, it will fetch all existing rows plus any new data since the last run.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependency Errors&lt;/strong&gt;: Ensure all required dependencies (e.g., &lt;code&gt;duckdb&lt;/code&gt;) are included in the &lt;code&gt;mcp.json&lt;/code&gt; configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Updates&lt;/strong&gt;: After modifying &lt;code&gt;mcp.json&lt;/code&gt;, wait a few seconds for the MCP server to apply changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IDE-Specific Issues&lt;/strong&gt;: Refer to your IDE's documentation for MCP-related troubleshooting.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;dlt MCP Server&lt;/strong&gt; simplifies pipeline management by enabling interactive data inspection and validation. Customize the &lt;code&gt;mcp.json&lt;/code&gt; configuration to support your specific pipeline destinations and dependencies.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>vscode</category>
      <category>cursor</category>
    </item>
    <item>
      <title>Peer Review 3: France Data Engineering Job Market Transformations, Visualization, and Feedback (Part 2)</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Fri, 02 May 2025 13:42:31 +0000</pubDate>
      <link>https://dev.to/pizofreude/peer-review-3-france-data-engineering-job-market-transformations-visualization-and-feedback-3ahl</link>
      <guid>https://dev.to/pizofreude/peer-review-3-france-data-engineering-job-market-transformations-visualization-and-feedback-3ahl</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome back to the last part peer review of the &lt;a href="https://github.com/aafaf655/DE-Job-Market-Analysis" rel="noopener noreferrer"&gt;France Data Engineering Job Market Analysis pipeline&lt;/a&gt;. In &lt;a href="https://dev.to/pizofreude/peer-review-3-france-data-engineering-job-market-analysis-pipeline-infra-part-1-2ei1"&gt;Part 1&lt;/a&gt;, we explored the project’s infrastructure, cloud setup, and orchestration. Now, we’ll go deeper into the heart of the data platform: transformations, data warehouse design, dashboarding, reproducibility, and actionable feedback.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Transformations with dbt
&lt;/h2&gt;

&lt;p&gt;Modern data engineering pipelines are built on modular, testable transformations—and dbt (Data Build Tool) shines in this space. This project structures its dbt codebase into &lt;strong&gt;staging&lt;/strong&gt;, &lt;strong&gt;core&lt;/strong&gt;, and &lt;strong&gt;marts&lt;/strong&gt; layers, following best practices for maintainability and scalability.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Staging Models:&lt;/strong&gt; Clean and standardize raw job posting data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Core Models:&lt;/strong&gt; Build core analytical tables, e.g., &lt;code&gt;fact_jobs&lt;/code&gt;, &lt;code&gt;dim_company&lt;/code&gt;, &lt;code&gt;dim_skills&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Marts Models:&lt;/strong&gt; Deliver analytics-ready tables for direct dashboard consumption, e.g., top skills, salary distribution, remote job trends.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Integration:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;dbt transformations are automated via Kestra, ensuring that new data is transformed and ready for analytics on a regular schedule.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Excellent use of dbt’s modular structure. The pipeline ensures all transformations are reproducible, testable, and production-ready. For further robustness, consider populating the &lt;code&gt;tests/&lt;/code&gt; and &lt;code&gt;macros/&lt;/code&gt; folders with custom tests and logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Data Warehouse Design
&lt;/h2&gt;

&lt;p&gt;The project leverages &lt;strong&gt;Google BigQuery&lt;/strong&gt; as the data warehouse, which is a solid choice for scalable analytics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;External Tables:&lt;/strong&gt; Raw CSVs in GCS are registered as external tables in BigQuery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native Tables &amp;amp; Marts:&lt;/strong&gt; Transformed data is materialized as native tables and views for efficient querying.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Partitioning &amp;amp; Clustering:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While the project’s structure suggests a thoughtful separation between staging and marts, there isn’t explicit documentation of table partitioning or clustering strategies. These can make a big difference in query performance and cost efficiency at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Good warehouse design with clear separation of concerns. Adding documentation and rationale for partitioning and clustering would further strengthen the warehouse layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Dashboarding &amp;amp; Data Products
&lt;/h2&gt;

&lt;p&gt;The final data product is a &lt;strong&gt;Power BI dashboard&lt;/strong&gt; that visualizes key insights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tiles include:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Top skills in demand&lt;/li&gt;
&lt;li&gt;Salary distribution&lt;/li&gt;
&lt;li&gt;Remote work trends&lt;/li&gt;
&lt;li&gt;Company performance and job trends over time&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The dashboard is visually clear (see screenshots in the &lt;code&gt;images/&lt;/code&gt; folder) and directly queries marts tables in BigQuery, ensuring up-to-date insights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F68qyagojjhzci79crxw4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F68qyagojjhzci79crxw4.png" alt="Dashboard" width="800" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Strong dashboard implementation. Multiple analytical tiles provide different perspectives for stakeholders, and the visuals are easy to interpret.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Reproducibility &amp;amp; Documentation
&lt;/h2&gt;

&lt;p&gt;Reproducibility is a cornerstone of engineering excellence. This project excels in that area:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="http://readme.md/" rel="noopener noreferrer"&gt;&lt;strong&gt;README.md&lt;/strong&gt;&lt;/a&gt; includes step-by-step instructions for everything—infra setup, data ingestion, dbt transformations, and dashboard connection.&lt;/li&gt;
&lt;li&gt;Sample config variables are provided, and the logical flow is easy to follow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clear, actionable documentation makes this project easy to run and adapt. Excellent work!&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Actionable Feedback &amp;amp; Areas for Growth
&lt;/h2&gt;

&lt;p&gt;Even great projects have room to grow! Here are some opportunities for further improvement and learning:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Warehouse Optimization:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Explicitly document partitioning and clustering strategies in BigQuery marts. Explain how these optimize for cost and performance.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing &amp;amp; CI/CD:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Add dbt tests for data quality (e.g., uniqueness, null checks) and consider adding pipeline-level validation.&lt;/li&gt;
&lt;li&gt;Explore integrating CI/CD (e.g., GitHub Actions) for automated testing and deployment.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow Transparency:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Include diagrams or screenshots of Kestra flows in the documentation for better orchestration visibility.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Ingestion (Optional):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;If real-time job data becomes available, consider building a streaming ingestion pipeline to expand the project’s scope.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Leverage dbt Advanced Features:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Utilize the empty &lt;code&gt;macros/&lt;/code&gt;, &lt;code&gt;tests/&lt;/code&gt;, and &lt;code&gt;snapshots/&lt;/code&gt; directories for more advanced dbt features, such as custom logic or snapshotting slowly changing dimensions.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion:
&lt;/h2&gt;

&lt;p&gt;Reviewing and learning from real-world projects is one of the best ways to grow as a data engineer. This project is a fantastic example of a modern, cloud-native data engineering pipeline—well-documented, automated, and designed for actionable analytics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Modular, testable transformations with dbt are the backbone of maintainable analytics pipelines.&lt;/li&gt;
&lt;li&gt;Clear separation between raw, staging, and marts layers makes analytics scalable and robust.&lt;/li&gt;
&lt;li&gt;Visualization is more than pretty charts—it’s about surfacing real insights for stakeholders.&lt;/li&gt;
&lt;li&gt;Great documentation is as important as great code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What’s next?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you enjoyed this review, try evaluating an open-source project yourself and utilize the learning opportunities provided by &lt;a href="https://datatalks.club/" rel="noopener noreferrer"&gt;Data Talks Club&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>cloud</category>
      <category>learning</category>
    </item>
    <item>
      <title>Peer Review 3: France Data Engineering Job Market Analysis Pipeline Infra (Part 1)</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Fri, 02 May 2025 13:30:01 +0000</pubDate>
      <link>https://dev.to/pizofreude/peer-review-3-france-data-engineering-job-market-analysis-pipeline-infra-part-1-2ei1</link>
      <guid>https://dev.to/pizofreude/peer-review-3-france-data-engineering-job-market-analysis-pipeline-infra-part-1-2ei1</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome to the third peer review series for &lt;a href="https://datatalks.club/blog/guide-to-free-online-courses-at-datatalks-club.html#data-engineering-zoomcamp" rel="noopener noreferrer"&gt;DataTalks Club Data Engineering Zoomcamp&lt;/a&gt;. In this post, I’ll be dissecting a real-world data engineering project that analyzes the &lt;a href="https://github.com/aafaf655/DE-Job-Market-Analysis" rel="noopener noreferrer"&gt;French Data Engineering job market&lt;/a&gt;. The goal? To break down the project’s infrastructure, orchestration, and cloud design—spotlighting what works well, what could be improved, and, most importantly, what we can all learn as practicing data engineers.&lt;/p&gt;

&lt;p&gt;Why do this? Because reviewing and sharing feedback on real-world projects sharpens our own skills, encourages open knowledge sharing, and helps us all grow together. Let’s dig in.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Overview
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Project:&lt;/strong&gt; &lt;a href="https://github.com/aafaf655/DE-Job-Market-Analysis" rel="noopener noreferrer"&gt;DE-Job-Market-Analysis (GitHub)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Objective:&lt;/strong&gt; Build an end-to-end, cloud-native pipeline to collect, store, transform, and visualize Data Engineering job postings for the French market.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key questions addressed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What is the demand for Data Engineering roles in France?&lt;/li&gt;
&lt;li&gt;Which skills and tools are most sought after?&lt;/li&gt;
&lt;li&gt;Which companies are hiring, and what are their workforce sizes?&lt;/li&gt;
&lt;li&gt;What are the salary trends and geographic patterns?&lt;/li&gt;
&lt;li&gt;How do job posting trends evolve over time?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Evaluation Criteria: How I Review
&lt;/h2&gt;

&lt;p&gt;For this peer review series, I use a structured rubric inspired by the &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/projects/README.md" rel="noopener noreferrer"&gt;DataTalksClub Data Engineering Zoomcamp&lt;/a&gt; project guidelines. The main areas of focus are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Problem Description&lt;/li&gt;
&lt;li&gt;Cloud Infrastructure (and Infrastructure as Code)&lt;/li&gt;
&lt;li&gt;Data Ingestion &amp;amp; Orchestration&lt;/li&gt;
&lt;li&gt;(Part 2: Transformations, Dashboarding, Reproducibility, and Actionable Feedback)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Problem Description
&lt;/h2&gt;

&lt;p&gt;Right from the start, the project’s README does an excellent job of motivating the work. It clearly explains &lt;strong&gt;why&lt;/strong&gt; understanding the French Data Engineering job market matters, lays out the business context, and lists the specific insights the pipeline aims to deliver.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The project aims to provide valuable insights into the demand for Data Engineering roles, most sought-after skills, key hiring companies, salary trends, locations, and job posting trends over time.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Excellent articulation! The clarity of context and objectives makes it easy for any reader (technical or not) to quickly understand the project’s purpose and value.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Cloud Infrastructure &amp;amp; IaC
&lt;/h2&gt;

&lt;p&gt;This project is cloud-native, leveraging &lt;strong&gt;Google Cloud Platform (GCP)&lt;/strong&gt; as the backbone.&lt;/p&gt;

&lt;p&gt;Key services used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BigQuery:&lt;/strong&gt; The analytical data warehouse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Storage (GCS):&lt;/strong&gt; For raw and processed data storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform:&lt;/strong&gt; Infrastructure as Code for reproducible, automated cloud resource provisioning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Stands Out
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The use of Terraform (with &lt;code&gt;main.tf&lt;/code&gt; and &lt;code&gt;variables.tf&lt;/code&gt;) to provision GCS buckets, BigQuery datasets, and service accounts is a mark of maturity—no click-ops here!&lt;/li&gt;
&lt;li&gt;The README provides step-by-step instructions for configuring GCP variables, applying Terraform, and setting up service accounts.&lt;/li&gt;
&lt;li&gt;The infrastructure is cleanly separated into its own directory, making the project modular and easy to maintain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Strong implementation of cloud and IaC best practices. The use of Terraform for GCP infra shows a solid grasp of production-grade deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Workflow Orchestration: Batch Pipelines with Kestra
&lt;/h2&gt;

&lt;p&gt;The pipeline’s automation is orchestrated using &lt;strong&gt;Kestra&lt;/strong&gt;, a modern workflow orchestration tool (think: Apache Airflow alternative, but YAML-first and developer-friendly).&lt;/p&gt;

&lt;h3&gt;
  
  
  How It’s Used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kestra flows&lt;/strong&gt; automate job posting scraping (using &lt;a href="https://github.com/speedyapply/JobSpy" rel="noopener noreferrer"&gt;JobSpy&lt;/a&gt;), data uploads to GCS, and the triggering of dbt transformations.&lt;/li&gt;
&lt;li&gt;The orchestration logic is defined in YAML files, located in a dedicated &lt;code&gt;kestra/&lt;/code&gt; directory.&lt;/li&gt;
&lt;li&gt;The workflow covers end-to-end batch scheduling: daily scraping, loading, and transformation, ensuring up-to-date analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Batch vs. Streaming
&lt;/h3&gt;

&lt;p&gt;This project focuses exclusively on &lt;strong&gt;batch processing&lt;/strong&gt;—scraping and updating the dataset on a periodic schedule. There’s no streaming ingestion (like Kafka), which is appropriate for the type of data source used here (static job listings).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Great use of Kestra for orchestrating a robust, modular DAG. For future iterations, consider adding diagrams or screenshots of the Kestra flows to make the orchestration even clearer for newcomers.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Data Ingestion: Batch (and What About Streaming?)
&lt;/h2&gt;

&lt;p&gt;The ingestion process is classic batch ETL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scraping&lt;/strong&gt;: Job postings are scraped using an external tool and saved as CSV.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loading&lt;/strong&gt;: CSVs are uploaded to GCS and registered as external tables in BigQuery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation:&lt;/strong&gt; All steps are orchestrated via Kestra.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why not streaming?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The data source (job boards) doesn’t support real-time feeds, so batch scraping is pragmatic. If live job posting APIs were ever available, a streaming pipeline could be an exciting next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The batch pipeline is well-automated and fit-for-purpose. The README makes it easy to understand and reproduce the process.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Contents of Interest (Project Structure Highlights)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;dbt/&lt;/code&gt; Directory&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Contains a full dbt project (&lt;code&gt;job_market_analysis&lt;/code&gt;) with:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dbt_project.yml&lt;/code&gt; (project config and structure).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;models/&lt;/code&gt; subdirectory with staging, core, and marts models.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;schema.yml&lt;/code&gt; for dbt model/table testing and documentation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;macros/&lt;/code&gt;, &lt;code&gt;tests/&lt;/code&gt;, &lt;code&gt;seeds/&lt;/code&gt;, and &lt;code&gt;snapshots/&lt;/code&gt; folders are present but currently empty.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;kestra/&lt;/code&gt; Directory&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;YAML flow definitions for workflow orchestration.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;terraform/&lt;/code&gt; Directory&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;main.tf&lt;/code&gt; and &lt;code&gt;variables.tf&lt;/code&gt; for GCP infrastructure provisioning.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;docker-compose.yml&lt;/code&gt;&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Used for local orchestration of services.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;images/&lt;/code&gt; Folder&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Includes dashboard screenshots for reference.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Comprehensive, clear, and actionable documentation.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion &amp;amp; What’s Next
&lt;/h2&gt;

&lt;p&gt;This project demonstrates a strong grasp of modern data engineering infrastructure design: cloud-native, reproducible, and automated. In &lt;strong&gt;Part 2&lt;/strong&gt;, I’ll dive into the transformation layer (dbt), data warehouse design, dashboarding, reproducibility, and provide actionable feedback for the project author—and for all of us as data engineers.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>learning</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Peer Review 2: Data Warehousing, Transformation, and Reproducibility in tfl-data-visualization (Part 2)</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Thu, 01 May 2025 16:01:32 +0000</pubDate>
      <link>https://dev.to/pizofreude/peer-review-2-data-warehousing-transformation-and-reproducibility-in-tfl-data-visualization-22pp</link>
      <guid>https://dev.to/pizofreude/peer-review-2-data-warehousing-transformation-and-reproducibility-in-tfl-data-visualization-22pp</guid>
      <description>&lt;p&gt;Welcome to the second part of my peer review series of the &lt;code&gt;tfl-data-visualization&lt;/code&gt; project—a cloud-native data engineering pipeline for &lt;a href="https://github.com/hbg108/tfl-data-visualization/tree/main" rel="noopener noreferrer"&gt;analyzing passenger footfall at London Tube and TfL Rail stations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/pizofreude/peer-review-2-tfl-station-footfall-data-analysis-pipeline-part-1-4909"&gt;Part 1&lt;/a&gt;, we explored how the project defines its problem, leverages cloud infrastructure, and orchestrates data ingestion. In this post, we’ll take a closer look at the advanced analytics stages: how the project handles data warehousing, transformation, visualization, and reproducibility. We'll also wrap up with overall feedback and actionable suggestions.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Data Warehouse: BigQuery Partitioning for Scalable Analytics
&lt;/h2&gt;

&lt;p&gt;A robust data warehouse is essential for analytical performance and cost-effectiveness. The project uses &lt;strong&gt;BigQuery&lt;/strong&gt; to store and query processed data. Notably:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partitioned Tables:&lt;/strong&gt; The ingestion pipeline consolidates data into native BigQuery tables that are partitioned by travel date. This is a best practice for optimizing query speed and reducing costs in large, time-series datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clear Rationale:&lt;/strong&gt; The README explains how and why partitioning is used, making it easy for reviewers and future maintainers to understand the design choices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessible Data:&lt;/strong&gt; Both external and native tables are created, supporting flexible exploration and downstream analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Excellent use of partitioning and cloud-native data warehouse features. If further improvements are desired, consider documenting or implementing clustering strategies for even more efficient queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Transformations with dbt
&lt;/h2&gt;

&lt;p&gt;Transformations are at the core of any data pipeline. This project uses &lt;strong&gt;dbt&lt;/strong&gt; (data build tool) to structure, document, and automate its transformation logic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Modular SQL Models:&lt;/strong&gt; The &lt;code&gt;dbt/models/&lt;/code&gt; directory contains SQL models, with at least one (&lt;code&gt;station_footfall_daily.sql&lt;/code&gt;) handling daily aggregation at the station level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Documentation:&lt;/strong&gt; The presence of &lt;code&gt;schema.yml&lt;/code&gt; provides both dbt testing and documentation, ensuring models are validated and well-described.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated Execution:&lt;/strong&gt; dbt runs are orchestrated via Kestra, guaranteeing transformations are up-to-date after each ingestion cycle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Very strong use of dbt for modular, testable, and automated transformations. The structure supports both maintainability and extensibility. Jinja SQL FTW!&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Dashboarding: Visualizing Insights with Looker Studio
&lt;/h2&gt;

&lt;p&gt;No analytics pipeline is complete without a way to visualize and share insights. The project delivers on this with a &lt;a href="https://lookerstudio.google.com/reporting/33cf406c-c312-4a59-bebd-5d8bf62e0ca6/page/BSfHF" rel="noopener noreferrer"&gt;&lt;strong&gt;Looker Studio dashboard&lt;/strong&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multiple Interactive Tiles:&lt;/strong&gt; The dashboard includes at least a time series chart and a station ranking table, providing multiple perspectives on the data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filtering &amp;amp; Interactivity:&lt;/strong&gt; Users can filter by dimension (date, station, tap type, etc.), making the dashboard useful for different stakeholders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessible Online:&lt;/strong&gt; The dashboard is linked and screenshots are provided in the repository for transparency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kih704d1hsxi0htvtra.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kih704d1hsxi0htvtra.png" alt="TfL Dashboard" width="800" height="398"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Excellent dashboard implementation, with clear, actionable visualizations and interactive filtering that effectively communicate the project’s findings.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Reproducibility: Clear, Actionable Documentation
&lt;/h2&gt;

&lt;p&gt;Reproducibility is key to collaboration and long-term success. This project excels in this regard:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step-by-Step Instructions:&lt;/strong&gt; The README covers everything from cloud credential setup to infrastructure provisioning, orchestration, and transformation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive Coverage:&lt;/strong&gt; Both local and cloud-based setups are described, making the project accessible to a wide range of users.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ease of Use:&lt;/strong&gt; With the provided instructions, anyone with appropriate cloud access can reproduce the results without guesswork.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Great job on ensuring reproducibility. The thorough, actionable documentation is a highlight, lowering the barrier for contributors and reviewers alike.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Summary of Feedback and Recommendations
&lt;/h2&gt;

&lt;p&gt;This project stands out for its clarity, modular design, automation, and modern use of cloud and open-source tools. Here’s a recap of what it does well, and where there’s room to grow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strengths:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear business problem and motivation.&lt;/li&gt;
&lt;li&gt;Fully automated, cloud-native architecture (GCP, Terraform, Kestra).&lt;/li&gt;
&lt;li&gt;Partitioned BigQuery tables for scalable analytics.&lt;/li&gt;
&lt;li&gt;Modular and reproducible transformations with dbt.&lt;/li&gt;
&lt;li&gt;Interactive, insightful dashboard.&lt;/li&gt;
&lt;li&gt;Excellent documentation and reproducibility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Areas for Improvement:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Testing and CI/CD:&lt;/strong&gt; Integrate dbt data tests and consider adding continuous integration for pipeline validation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring &amp;amp; Alerting:&lt;/strong&gt; Add pipeline monitoring and notification mechanisms for production robustness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration Visualization:&lt;/strong&gt; Include screenshots or diagrams of Kestra flows for enhanced documentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Warehouse Optimization:&lt;/strong&gt; Consider clustering or additional optimization strategies in BigQuery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Data:&lt;/strong&gt; If the data source evolves, explore adding real-time streaming capabilities.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://github.com/hbg108/tfl-data-visualization/tree/main" rel="noopener noreferrer"&gt;&lt;code&gt;tfl-data-visualization&lt;/code&gt;&lt;/a&gt; project exemplifies the best practices of data engineering: clarity, automation, scalability, and actionable business insights. Peer reviews like this not only celebrate what works but also help teams identify opportunities to make great projects even better. Special thanks &lt;a href="https://datatalks.club/" rel="noopener noreferrer"&gt;DataTalks Club&lt;/a&gt; for this learning opportunity.&lt;/p&gt;

</description>
      <category>dezooomcamp</category>
      <category>dataengineering</category>
      <category>bigquery</category>
      <category>dbt</category>
    </item>
    <item>
      <title>Peer Review 2: TfL Station Footfall Data Analysis Pipeline (Part 1)</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Thu, 01 May 2025 15:50:24 +0000</pubDate>
      <link>https://dev.to/pizofreude/peer-review-2-tfl-station-footfall-data-analysis-pipeline-part-1-4909</link>
      <guid>https://dev.to/pizofreude/peer-review-2-tfl-station-footfall-data-analysis-pipeline-part-1-4909</guid>
      <description>&lt;p&gt;Peer reviews are a cornerstone of building high-quality data engineering projects. They don’t just help catch bugs and inefficiencies—they unlock opportunities for improvement, learning, and robust collaboration. In this two-part series, I’m diving into a peer review of the &lt;a href="https://github.com/hbg108/tfl-data-visualization/tree/main" rel="noopener noreferrer"&gt;&lt;code&gt;tfl-data-visualization&lt;/code&gt;&lt;/a&gt; project, which leverages public Transport for London (TfL) Oyster card data to uncover insights about passenger flows across London’s extensive rail network.&lt;/p&gt;

&lt;p&gt;In Part 1, we’ll focus on the project’s foundation: the problem it tackles, its cloud-native architecture, and the orchestration of data ingestion. The goal is to demonstrate how a senior-level data engineering project is structured, documented, and automated for real-world impact.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Overview: What is tfl-data-visualization?
&lt;/h2&gt;

&lt;p&gt;This project is a modern data engineering pipeline designed to analyze footfall data from London Tube and TfL Rail stations. By using &lt;a href="https://crowding.data.tfl.gov.uk/" rel="noopener noreferrer"&gt;open data&lt;/a&gt; on Oyster card tap-ins and tap-outs, the project enables granular analysis of how passengers move through the city’s transport network. The end product is a Looker Studio dashboard powered by data pipelines that automate everything from raw data ingestion to warehouse transformations.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Problem Description
&lt;/h2&gt;

&lt;p&gt;A well-defined business problem is the first step towards a meaningful solution. This project excels here. The README clearly articulates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The business context:&lt;/strong&gt; Understanding passenger flows can help with optimizing station management, reducing congestion, and supporting infrastructure decisions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The data source:&lt;/strong&gt; Publicly available TfL Oyster card tap count data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project goals:&lt;/strong&gt; Automate data collection, processing, and visualization to enable data-driven insights for stakeholders.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Excellent articulation of the problem and its real-world significance. The clarity helps the reader quickly understand the project’s goals and value.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Cloud Infrastructure and IaC
&lt;/h2&gt;

&lt;p&gt;Modern data engineering projects are built for the cloud, and this project demonstrates that ethos. The pipeline is developed for Google Cloud Platform (GCP), featuring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BigQuery as the data warehouse:&lt;/strong&gt; Scalable, cost-efficient, and optimized for analytics.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Storage (GCS) for raw data:&lt;/strong&gt; Centralized, secure cloud storage for source files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure as Code (IaC) with Terraform:&lt;/strong&gt; All GCP resources are provisioned automatically, ensuring repeatability and minimizing manual setup.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Outstanding use of cloud technologies and automation. Leveraging Terraform for GCP infra shows strong cloud engineering practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Data Ingestion: Batch Processing and Workflow Orchestration
&lt;/h2&gt;

&lt;p&gt;Automation is at the heart of reliability and scalability. The project uses &lt;a href="https://kestra.io/" rel="noopener noreferrer"&gt;&lt;strong&gt;Kestra&lt;/strong&gt;&lt;/a&gt; as its workflow orchestrator, building a robust, automated batch pipeline that covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated downloading&lt;/strong&gt; of multi-year historical CSV data from TfL’s open data portal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uploading files&lt;/strong&gt; to Google Cloud Storage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loading data&lt;/strong&gt; into BigQuery, both as external tables and as native tables for partitioned, consolidated analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DAG orchestration:&lt;/strong&gt; Kestra flows define the data ingestion DAG, with scheduled weekly updates and modular subflows for maintainability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Review Comment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Great job with fully automated workflow orchestration. Consider including visual examples or screenshots of Kestra flows to further clarify the orchestration structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Wrapping Up Part 1
&lt;/h2&gt;

&lt;p&gt;In this first part, we’ve established the strong foundation on which the &lt;a href="https://github.com/hbg108/tfl-data-visualization/tree/main" rel="noopener noreferrer"&gt;&lt;code&gt;tfl-data-visualization&lt;/code&gt;&lt;/a&gt; project is built: a clearly defined problem, cloud-native architecture, and automated data ingestion. These elements are critical for any ambitious data engineering initiative, ensuring not only technical excellence but also business relevance and operational scalability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stay tuned for Part 2&lt;/strong&gt;, where we’ll dive deeper into data warehousing strategies, transformation logic with dbt, dashboarding, reproducibility, and overall peer review feedback.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>dbt</category>
      <category>kestra</category>
    </item>
    <item>
      <title>Peer Review 1: Poland's Real Estate Market Dashboards and Insights with Streamlit (Part 2)</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Wed, 30 Apr 2025 16:17:37 +0000</pubDate>
      <link>https://dev.to/pizofreude/peer-review-1-polands-real-estate-market-dashboards-and-insights-with-streamlit-part-2-5eah</link>
      <guid>https://dev.to/pizofreude/peer-review-1-polands-real-estate-market-dashboards-and-insights-with-streamlit-part-2-5eah</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome to the second part of &lt;strong&gt;Peer Review 1&lt;/strong&gt;, where we continue exploring the data engineering project focused on &lt;a href="https://github.com/elgrassa/Data-engineering-professional-certificate/tree/main" rel="noopener noreferrer"&gt;analyzing Poland's real estate market&lt;/a&gt;. In the &lt;a href="https://dev.to/pizofreude/peer-review-1-analyzing-polands-real-estate-market-part-1-2c6d"&gt;first post&lt;/a&gt;, we reviewed the problem description, batch data ingestion pipeline, and cloud setup using &lt;strong&gt;Kestra&lt;/strong&gt;, &lt;strong&gt;BigQuery&lt;/strong&gt;, and &lt;strong&gt;dbt Cloud&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this post, we’ll dive into the &lt;strong&gt;Streamlit dashboard&lt;/strong&gt;, &lt;strong&gt;data transformations&lt;/strong&gt;, and the &lt;strong&gt;insights&lt;/strong&gt; derived from the project. We'll also discuss future improvements and potential optimizations to enhance the project further.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dashboard Implementation: Streamlit at the Forefront
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Streamlit Overview
&lt;/h3&gt;

&lt;p&gt;Streamlit is used to create a &lt;a href="https://polish-flats-ps.streamlit.app/" rel="noopener noreferrer"&gt;static dashboard&lt;/a&gt; that visualizes data trends and insights. The dashboard provides a clear overview of Poland's real estate market, focusing on rental and sales trends across various cities. It includes features such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visualizations of market trends:&lt;/strong&gt; Median and 95th percentile prices, city-wise activity, and price distributions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forwp910pg3pe2vp8heam.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forwp910pg3pe2vp8heam.png" alt="Dashboard 1" width="800" height="511"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Static data integration:&lt;/strong&gt; Pre-processed CSV files are used to power the dashboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Transition to a Dynamic Dashboard (Planned)
&lt;/h3&gt;

&lt;p&gt;While the current implementation is static, the project owner plans to enhance the dashboard by integrating dynamic and interactive features. This would allow users to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Filter data by city, price range, and transaction type (rent/sale).&lt;/li&gt;
&lt;li&gt;Interact with visualizations dynamically to explore trends and insights in real-time.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Data Transformations with dbt
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Transforming Raw Data into Insights
&lt;/h3&gt;

&lt;p&gt;The project uses &lt;strong&gt;dbt Cloud&lt;/strong&gt; to transform raw data into analysis-ready tables. These transformations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleaning and standardizing raw CSV data.&lt;/li&gt;
&lt;li&gt;Aggregating data by city, transaction type, and time period.&lt;/li&gt;
&lt;li&gt;Calculating metrics like median prices, percentiles, and total listings.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example SQL Models
&lt;/h3&gt;

&lt;p&gt;Here’s an example of a dbt model that calculates city-level rental price trends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;city_prices&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;transaction_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_listings&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'raw_data'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;transaction_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'rent'&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transaction_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;median_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;percentile_95_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_listings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_rental_listings&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;city_prices&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This model ensures that the data used in the dashboard is well-structured and consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Insights from the Data
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Rental and Sales Trends
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High-Activity Cities:&lt;/strong&gt; Warsaw and Kraków consistently show higher rental and sales activity compared to smaller cities like Bydgoszcz and Szczecin.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Price Distributions:&lt;/strong&gt; Median prices are significantly lower in smaller cities, while the 95th percentile prices indicate luxury market trends in larger cities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bwjd2skvf5mx1ot2yx0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bwjd2skvf5mx1ot2yx0.png" alt="Dashboard 2" width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Percentile-Based Trends
&lt;/h3&gt;

&lt;p&gt;A line chart in the static dashboard compares median and 95th percentile prices for each city. The data highlights percentile trends in the real estate market, with rental activity peaking by the 95th percentile.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag8fnu6ofo2w2a1s12pf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag8fnu6ofo2w2a1s12pf.png" alt="Dashboard 3" width="800" height="503"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Total Listings vs Average Price&lt;/strong&gt; Comparisons
&lt;/h3&gt;

&lt;p&gt;These charts provide a visual representation of cities with high activity levels relative to average prices. Rental prices exhibit smaller fluctuations compared to total listings, which tend to be more volatile.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xlucdzgvmupj3a7qkoz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2xlucdzgvmupj3a7qkoz.png" alt="Dashboard 4" width="800" height="508"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Future Improvements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Transition to a Dynamic Dashboard
&lt;/h3&gt;

&lt;p&gt;The planned upgrade to an interactive and dynamic dashboard will provide users with real-time filtering and visualization capabilities.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Dynamic Data Updates
&lt;/h3&gt;

&lt;p&gt;Integrating a streaming data pipeline could enable real-time updates for the dashboard, keeping it current with the latest market data.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Advanced Filtering
&lt;/h3&gt;

&lt;p&gt;Adding more advanced filters (e.g., by property type, number of rooms) could enhance the user experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Predictive Analytics
&lt;/h3&gt;

&lt;p&gt;Incorporating time-series forecasting models could provide users with future price trends and market predictions.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Optimization in BigQuery
&lt;/h3&gt;

&lt;p&gt;Partitioning and clustering the BigQuery tables could significantly improve query performance for larger datasets.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This post reviewed the current static Streamlit dashboard, data transformations with dbt, and key insights derived from the project. While the static dashboard provides clear visualizations, the planned interactive upgrade will make it more dynamic and user-friendly. Additionally, future improvements like real-time updates and predictive analytics can further enhance the project's impact.&lt;/p&gt;




&lt;h3&gt;
  
  
  Related Posts
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/pizofreude/peer-review-1-analyzing-polands-real-estate-market-part-1-2c6d"&gt;Peer Review 1: Analyzing Poland's Real Estate Market (Part 1)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Thanks for this learning opportunity &lt;a href="https://datatalks.club/" rel="noopener noreferrer"&gt;DataTalks Club&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>streamlit</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Peer Review 1: Analyzing Poland's Real Estate Market (Part 1)</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Wed, 30 Apr 2025 15:42:28 +0000</pubDate>
      <link>https://dev.to/pizofreude/peer-review-1-analyzing-polands-real-estate-market-part-1-2c6d</link>
      <guid>https://dev.to/pizofreude/peer-review-1-analyzing-polands-real-estate-market-part-1-2c6d</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Welcome to the first part of &lt;strong&gt;Peer Review 1&lt;/strong&gt; for &lt;a href="https://github.com/DataTalksClub/data-engineering-zoomcamp" rel="noopener noreferrer"&gt;DTC DEZOOMCAMP&lt;/a&gt;&lt;strong&gt;.&lt;/strong&gt; This two-part series provides an in-depth review of a data engineering pipeline designed to analyze &lt;a href="https://github.com/elgrassa/Data-engineering-professional-certificate/tree/main" rel="noopener noreferrer"&gt;Poland's real estate market&lt;/a&gt;. The project demonstrates the use of modern data engineering tools such as &lt;strong&gt;BigQuery&lt;/strong&gt;, &lt;strong&gt;dbt Cloud&lt;/strong&gt;, and &lt;strong&gt;Kestra&lt;/strong&gt;, along with a &lt;strong&gt;Streamlit&lt;/strong&gt; dashboard for visualization.&lt;/p&gt;

&lt;p&gt;This post will focus on the &lt;strong&gt;problem description&lt;/strong&gt;, &lt;strong&gt;data ingestion pipeline&lt;/strong&gt;, and the &lt;strong&gt;cloud setup&lt;/strong&gt;, while the next post will explore the interactive dashboard and insights.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem Description
&lt;/h2&gt;

&lt;p&gt;The project aims to analyze Poland's real estate market, focusing on rental and sales trends across various cities. By processing and visualizing the data, the following questions are addressed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Which cities have the highest rental or sales activity?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What are the price trends across different cities?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How does the real estate market vary between rentals and sales?&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A &lt;a href="https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland" rel="noopener noreferrer"&gt;dataset&lt;/a&gt; from Kaggle, containing apartment prices in Poland, serves as the starting point. This dataset includes details such as city names, transaction types (rent/sale), and prices. The primary challenge lies in transforming the raw CSV data into actionable insights while ensuring scalability and reproducibility.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Ingestion: Batch Processing with Kestra
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Workflow Orchestration
&lt;/h3&gt;

&lt;p&gt;The project employs &lt;strong&gt;Kestra&lt;/strong&gt; for handling multiple CSV files and automating the ETL process. The workflow includes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Extraction:&lt;/strong&gt; CSV files containing raw real estate data are ingested into the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Transformation:&lt;/strong&gt; Kestra facilitates cleaning and structuring the data for analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Loading:&lt;/strong&gt; The cleaned data is loaded into both &lt;strong&gt;PostgreSQL&lt;/strong&gt; (for local analysis) and &lt;strong&gt;BigQuery&lt;/strong&gt; (for cloud-based analysis).&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why Kestra?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://kestra.io/" rel="noopener noreferrer"&gt;Kestra&lt;/a&gt; provides the ability to automate the entire ETL process, ensuring consistency and minimizing manual intervention. Although the dataset isn’t updated regularly, the pipeline is scalable and can handle new data efficiently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Kestra Flow
&lt;/h3&gt;

&lt;p&gt;An example Kestra flow processes the CSV files by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Taking file paths and metadata (e.g., month and year) as input.&lt;/li&gt;
&lt;li&gt;Executing tasks for data cleaning, validation, and loading.&lt;/li&gt;
&lt;li&gt;Producing cleaned data as output in BigQuery and PostgreSQL.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Cloud Setup: BigQuery and dbt Cloud
&lt;/h2&gt;

&lt;h3&gt;
  
  
  BigQuery as the Data Warehouse
&lt;/h3&gt;

&lt;p&gt;BigQuery serves as the data warehouse for storing and querying the transformed data. Its serverless architecture and scalability make it an excellent choice. Key features utilized include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQL Queries:&lt;/strong&gt; Used to analyze price distributions, trends, and city-level activity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with dbt Cloud:&lt;/strong&gt; Enables modular and reusable transformations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Transformations with dbt Cloud
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;dbt Cloud&lt;/strong&gt; is employed for data cleaning and structuring. It allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing modular SQL models.&lt;/li&gt;
&lt;li&gt;Testing data integrity.&lt;/li&gt;
&lt;li&gt;Creating curated tables with calculated fields like medians, percentiles, and trends.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example dbt Configuration
&lt;/h3&gt;

&lt;p&gt;Below is a snippet from the &lt;code&gt;dbt_project.yml&lt;/code&gt; file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;polish_flats_dbt'&lt;/span&gt;
&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1.0'&lt;/span&gt;
&lt;span class="na"&gt;config-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;span class="na"&gt;profile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;default'&lt;/span&gt;  &lt;span class="c1"&gt;# Use the default profile from profiles.yml&lt;/span&gt;
&lt;span class="na"&gt;model-paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;models&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Challenges and Workarounds
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Challenge:&lt;/strong&gt; Streamlit occasionally failed due to sync delays from the US cluster of dbt Cloud.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workaround:&lt;/strong&gt; Pre-exported CSVs were used for local analysis, significantly improving performance and reliability.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;

&lt;p&gt;The README file provides detailed instructions for setting up the project locally. These include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Setting up &lt;strong&gt;PostgreSQL&lt;/strong&gt; and &lt;strong&gt;Kestra&lt;/strong&gt; using Docker.&lt;/li&gt;
&lt;li&gt;Installing dependencies for dbt and running transformations.&lt;/li&gt;
&lt;li&gt;Configuring BigQuery and dbt Cloud for seamless integration.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Running Locally
&lt;/h3&gt;

&lt;p&gt;The following steps can be followed to run the pipeline locally:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Clone the repository:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;https://github.com/elgrassa/Data-engineering-professional-certificate.git&amp;gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;Data-engineering-professional-certificate

&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Start PostgreSQL and Kestra using Docker:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker-compose &lt;span class="nt"&gt;-p&lt;/span&gt; kestra-postgres up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Install dependencies:&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-bigquery

&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This post reviewed the problem description, batch data ingestion pipeline with Kestra, and the cloud setup using BigQuery and dbt Cloud. These components form the backbone of the project, enabling efficient ETL processes and scalable storage.&lt;/p&gt;

&lt;p&gt;The next post will delve into the &lt;strong&gt;Streamlit dashboard&lt;/strong&gt;, &lt;strong&gt;visualizations&lt;/strong&gt;, and &lt;strong&gt;insights&lt;/strong&gt; derived from the data.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>dbt</category>
      <category>kestra</category>
    </item>
    <item>
      <title>InsightFlow Part 9: Workflow Orchestration with Kestra</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Tue, 29 Apr 2025 07:31:55 +0000</pubDate>
      <link>https://dev.to/pizofreude/insightflow-part-9-workflow-orchestration-with-kestra-3cge</link>
      <guid>https://dev.to/pizofreude/insightflow-part-9-workflow-orchestration-with-kestra-3cge</guid>
      <description>&lt;h1&gt;
  
  
  9. Workflow Orchestration with Kestra
&lt;/h1&gt;

&lt;p&gt;In modern data engineering, orchestrating workflows is a critical component of building reliable, scalable, and automated data pipelines. For the &lt;strong&gt;InsightFlow&lt;/strong&gt; project, we leverage &lt;strong&gt;Kestra&lt;/strong&gt;, an open-source declarative orchestration platform, to manage the end-to-end workflow of ingesting, transforming, and analyzing retail and economic data from public sources. This blog post will walk you through how Kestra is used in this project and why it is an excellent choice for workflow orchestration.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Kestra?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Kestra is a modern orchestration platform designed to simplify the management of complex workflows. It offers several features that make it ideal for the InsightFlow project:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Declarative Workflow Design&lt;/strong&gt;: Workflows are defined in YAML, making them easy to read, version-control, and maintain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Kestra can handle large-scale workflows with hundreds of tasks, ensuring reliability even under heavy loads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensibility&lt;/strong&gt;: With over 600 plugins, Kestra supports a wide range of tasks, including AWS services, database queries, and custom scripts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt;: Kestra provides detailed logs, metrics, and monitoring tools to track workflow execution and troubleshoot issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with Modern Tools&lt;/strong&gt;: Kestra integrates seamlessly with Git, Terraform, and other tools, enabling a streamlined CI/CD pipeline.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Kestra in the InsightFlow Project&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In the InsightFlow project, Kestra orchestrates the following key workflows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Ingestion&lt;/strong&gt;: Fetching raw data from public sources using AWS Batch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Transformation&lt;/strong&gt;: Running dbt models to clean, normalize, and structure the data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Cataloging&lt;/strong&gt;: Updating the AWS Glue Data Catalog to reflect the latest data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing and Validation&lt;/strong&gt;: Running dbt tests to ensure data quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduling and Automation&lt;/strong&gt;: Automating the entire pipeline to run on a daily schedule.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Workflow Overview&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The Kestra workflow for the production environment is defined in the file &lt;a href="//c:\workspace\insightflow-retail-economic-pipeline\kestra\flows\insightflow_prod_pipeline.yml"&gt;&lt;code&gt;insightflow_prod_pipeline.yml&lt;/code&gt;&lt;/a&gt;. Below is an overview of the key tasks:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Data Ingestion via AWS Batch&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The workflow starts by submitting an AWS Batch job to ingest raw data from public sources into the S3 bucket &lt;code&gt;insightflow-prod-raw-data&lt;/code&gt;. This is achieved using the following task:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;submit_batch_ingestion_job_cli&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.core.tasks.scripts.Bash&lt;/span&gt;
  &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;echo "Submitting AWS Batch Job..."&lt;/span&gt;
      &lt;span class="s"&gt;JOB_DEF_NAME="insightflow-prod-ingestion-job-def"&lt;/span&gt;
      &lt;span class="s"&gt;JOB_QUEUE_NAME="insightflow-prod-job-queue"&lt;/span&gt;
      &lt;span class="s"&gt;TARGET_BUCKET_NAME="insightflow-prod-raw-data"&lt;/span&gt;
      &lt;span class="s"&gt;AWS_REGION="ap-southeast-2"&lt;/span&gt;

      &lt;span class="s"&gt;JOB_NAME="insightflow-ingestion-{{execution.id}}"&lt;/span&gt;
      &lt;span class="s"&gt;JOB_OUTPUT=$(aws batch submit-job \\&lt;/span&gt;
        &lt;span class="s"&gt;--region "$AWS_REGION" \\&lt;/span&gt;
        &lt;span class="s"&gt;--job-name "$JOB_NAME" \\&lt;/span&gt;
        &lt;span class="s"&gt;--job-queue "$JOB_QUEUE_NAME" \\&lt;/span&gt;
        &lt;span class="s"&gt;--job-definition "$JOB_DEF_NAME" \\&lt;/span&gt;
        &lt;span class="s"&gt;--container-overrides '{&lt;/span&gt;
            &lt;span class="s"&gt;"environment": [&lt;/span&gt;
              &lt;span class="s"&gt;{"name": "TARGET_BUCKET", "value": "'"$TARGET_BUCKET_NAME"'"}&lt;/span&gt;
            &lt;span class="s"&gt;]&lt;/span&gt;
          &lt;span class="s"&gt;}')&lt;/span&gt;

      &lt;span class="s"&gt;JOB_ID=$(echo "$JOB_OUTPUT" | grep -o '"jobId": "[^"]*' | awk -F'"' '{print $4}')&lt;/span&gt;
      &lt;span class="s"&gt;echo "Submitted Job ID: $JOB_ID"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Updating the Glue Data Catalog&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once the raw data is ingested, the workflow triggers an AWS Glue Crawler to update the Glue Data Catalog. This ensures that the latest data is available for querying in Athena.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;start_glue_crawler_cli&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.core.tasks.scripts.Bash&lt;/span&gt;
  &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;echo "Starting AWS Glue Crawler..."&lt;/span&gt;
      &lt;span class="s"&gt;CRAWLER_NAME="insightflow-prod-raw-data-crawler"&lt;/span&gt;
      &lt;span class="s"&gt;AWS_REGION="ap-southeast-2"&lt;/span&gt;

      &lt;span class="s"&gt;aws glue start-crawler --region $AWS_REGION --name "$CRAWLER_NAME"&lt;/span&gt;
      &lt;span class="s"&gt;echo "Crawler $CRAWLER_NAME started."&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;3. Running dbt Models&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After the data is cataloged, the workflow runs dbt models to transform the raw data into an analysis-ready format. This includes tasks for syncing dbt files, installing dependencies, and running the models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dbt_run&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.dbt.cli.DbtCLI&lt;/span&gt;
  &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;dbt run --target prod&lt;/span&gt;
    &lt;span class="na"&gt;namespaceFiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;containerImage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pizofreude/kestra-dbt-athena:latest&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;4. Testing and Validation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To ensure data quality, the workflow runs dbt tests on the transformed data. Any issues are logged for further investigation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dbt_test&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.dbt.cli.DbtCLI&lt;/span&gt;
  &lt;span class="na"&gt;commands&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;dbt test --target prod&lt;/span&gt;
    &lt;span class="na"&gt;namespaceFiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;containerImage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pizofreude/kestra-dbt-athena:latest&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;5. Scheduling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The workflow is scheduled to run daily at 5:00 AM UTC using Kestra's scheduling feature.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;daily_schedule&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;io.kestra.plugin.core.trigger.Schedule&lt;/span&gt;
    &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Benefits of Using Kestra&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt;: Kestra automates the entire pipeline, reducing manual intervention and ensuring consistency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error Handling&lt;/strong&gt;: With built-in retry mechanisms and detailed logs, Kestra makes it easy to identify and resolve issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Kestra can handle large-scale workflows with multiple tasks and dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility&lt;/strong&gt;: The declarative YAML syntax allows for easy customization and extension of workflows.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Getting Started with Kestra&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To set up Kestra for your own projects, follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Install Kestra&lt;/strong&gt;: Refer to the &lt;a href="https://www.kestra.io/docs/" rel="noopener noreferrer"&gt;Kestra documentation&lt;/a&gt; for installation instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define Workflows&lt;/strong&gt;: Create YAML files to define your workflows, as shown in the examples above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run Workflows&lt;/strong&gt;: Use the Kestra UI or CLI to execute and monitor your workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrate with CI/CD&lt;/strong&gt;: Use Git and Terraform to version-control and deploy your workflows.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Kestra is a powerful tool for orchestrating workflows in modern data pipelines. In the InsightFlow project, it plays a crucial role in automating the ingestion, transformation, and validation of retail and economic data. By leveraging Kestra's features, we ensure that the pipeline is reliable, scalable, and easy to maintain.&lt;/p&gt;

&lt;p&gt;If you're building a similar project, consider using Kestra to simplify your workflow orchestration. For more details, check out the &lt;a href="https://www.kestra.io/docs/" rel="noopener noreferrer"&gt;Kestra documentation&lt;/a&gt; or explore the &lt;a href="https://github.com/pizofreude/insightflow-retail-economic-pipeline" rel="noopener noreferrer"&gt;InsightFlow repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy orchestrating!&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>orchestration</category>
      <category>kestra</category>
    </item>
    <item>
      <title>InsightFlow Part 8: Setting Up AWS Athena for Data Analysis in InsightFlow</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Tue, 29 Apr 2025 03:27:07 +0000</pubDate>
      <link>https://dev.to/pizofreude/insightflow-part-8-setting-up-aws-athena-for-data-analysis-in-insightflow-4aoo</link>
      <guid>https://dev.to/pizofreude/insightflow-part-8-setting-up-aws-athena-for-data-analysis-in-insightflow-4aoo</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/pizofreude/insightflow-retail-economic-pipeline" rel="noopener noreferrer"&gt;InsightFlow GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post, we’ll explore how &lt;strong&gt;Amazon Athena&lt;/strong&gt; was set up for querying and analyzing data in the &lt;strong&gt;InsightFlow&lt;/strong&gt; project. Athena is a serverless, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It’s an essential component of the InsightFlow pipeline, enabling efficient querying of both raw and transformed data.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Amazon Athena?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Amazon Athena is an ideal choice for InsightFlow due to its:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Serverless Architecture&lt;/strong&gt;: No infrastructure to manage; you only pay for the queries you run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seamless Integration with S3&lt;/strong&gt;: Queries data directly from S3 without requiring ETL processes to move data elsewhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support for Open Formats&lt;/strong&gt;: Works with Parquet, ORC, JSON, and other formats, ensuring compatibility with the data pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partitioning and Optimization&lt;/strong&gt;: Supports partitioning and compression to reduce query costs and improve performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For InsightFlow, Athena is used to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query raw and processed data stored in S3.&lt;/li&gt;
&lt;li&gt;Analyze trends in retail sales and fuel prices.&lt;/li&gt;
&lt;li&gt;Serve as the backend for dashboards in AWS QuickSight.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Preparing the Data in S3&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The data pipeline stores both raw and transformed data in S3 buckets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Raw Data&lt;/strong&gt;: Stored in the &lt;code&gt;insightflow-prod-raw-data&lt;/code&gt; bucket under the &lt;code&gt;raw/&lt;/code&gt; prefix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processed Data&lt;/strong&gt;: Stored in the &lt;code&gt;insightflow-prod-processed-data&lt;/code&gt; bucket under the &lt;code&gt;processed/&lt;/code&gt; prefix.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Partitioning the Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To optimize query performance, the processed data is partitioned by &lt;code&gt;year&lt;/code&gt; and &lt;code&gt;month&lt;/code&gt;. For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://insightflow-prod-processed-data/fct_retail_sales_monthly/year=2025/month=04/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Partitioning allows Athena to scan only the relevant data, reducing query costs and improving performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Setting Up the Glue Data Catalog&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Athena relies on the &lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt; to store metadata about the datasets. Glue Crawlers were used to automatically discover schemas and populate the Data Catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Glue Crawler Configuration&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The Glue Crawler scans the &lt;code&gt;processed&lt;/code&gt; S3 bucket and creates tables in the Glue database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_glue_crawler"&lt;/span&gt; &lt;span class="s2"&gt;"processed_data_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"insightflow-prod-processed-data-crawler"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;glue_crawler_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;database_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_glue_catalog_database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dbt_database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;

  &lt;span class="nx"&gt;s3_target&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3://insightflow-prod-processed-data/processed/"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;schema_change_policy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;update_behavior&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"UPDATE_IN_DATABASE"&lt;/span&gt;
    &lt;span class="nx"&gt;delete_behavior&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"LOG"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;
    &lt;span class="nx"&gt;Project&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"InsightFlow"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the crawler is run, the processed data is available as tables in the Glue Data Catalog.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Configuring Athena&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Creating a Workgroup&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Athena Workgroups help manage query costs and monitor usage. A workgroup was created for the project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_athena_workgroup"&lt;/span&gt; &lt;span class="s2"&gt;"insightflow_workgroup"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"insightflow-prod-workgroup"&lt;/span&gt;

  &lt;span class="nx"&gt;configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;enforce_workgroup_configuration&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;publish_cloudwatch_metrics_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;result_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;output_location&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3://insightflow-prod-processed-data/athena-results/"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;
    &lt;span class="nx"&gt;Project&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"InsightFlow"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Setting Query Results Location&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Athena query results are stored in the &lt;code&gt;athena-results/&lt;/code&gt; prefix of the &lt;code&gt;processed&lt;/code&gt; S3 bucket:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://insightflow-prod-processed-data/athena-results/

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that query results are accessible for debugging and downstream processing.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Querying Data with Athena&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once the Glue Crawler has populated the Data Catalog, the data can be queried using SQL in the Athena console or programmatically via the AWS CLI or SDK.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example Query: Analyzing Retail Sales and Fuel Prices&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The following query analyzes the correlation between retail sales and fuel prices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_value_rm_mil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_ron95_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_ron97_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_diesel_price&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;fct_retail_sales_monthly&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt;
    &lt;span class="n"&gt;fuelprice_monthly&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2025&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query joins the &lt;code&gt;fct_retail_sales_monthly&lt;/code&gt; fact table with the &lt;code&gt;fuelprice_monthly&lt;/code&gt; table to analyze trends.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Optimizing Athena Queries&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Use Partitioning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Partitioning the data by &lt;code&gt;year&lt;/code&gt; and &lt;code&gt;month&lt;/code&gt; ensures that Athena scans only the relevant partitions, reducing query costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Use Parquet Format&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The data is stored in Parquet format, which is optimized for analytical queries due to its columnar storage and compression.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Limit Data Scanned&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;SELECT&lt;/code&gt; statements to query only the required columns and apply filters (e.g., &lt;code&gt;WHERE year = 2025&lt;/code&gt;) to minimize the amount of data scanned.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Monitor Query Costs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Athena Workgroups provide metrics in CloudWatch to monitor query costs and performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 6: Integrating Athena with QuickSight&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Athena serves as the backend for dashboards in AWS QuickSight. QuickSight connects to Athena using the Glue Data Catalog, enabling interactive visualizations of retail sales and fuel price trends.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Challenges and Lessons Learned&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Managing schema changes in Glue required careful configuration of the &lt;code&gt;schema_change_policy&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partitioning Strategy&lt;/strong&gt;: Choosing the right partitioning strategy was critical for optimizing query performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Management&lt;/strong&gt;: Monitoring query costs in Athena Workgroups helped identify and optimize expensive queries.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Amazon Athena is a powerful tool for querying and analyzing data directly in S3. By integrating Athena with the Glue Data Catalog and optimizing the data layout, InsightFlow enables efficient, cost-effective data analysis.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>aws</category>
      <category>analytics</category>
    </item>
    <item>
      <title>InsightFlow Part 7: Data Quality Implementation &amp; Best Practices for InsightFlow</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Tue, 29 Apr 2025 03:04:01 +0000</pubDate>
      <link>https://dev.to/pizofreude/insightflow-part-7-data-quality-implementation-best-practices-for-insightflow-27cp</link>
      <guid>https://dev.to/pizofreude/insightflow-part-7-data-quality-implementation-best-practices-for-insightflow-27cp</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/pizofreude/insightflow-retail-economic-pipeline" rel="noopener noreferrer"&gt;InsightFlow GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post, we’ll explore how &lt;strong&gt;data quality&lt;/strong&gt; was implemented in the InsightFlow project and share best practices for ensuring reliable and accurate data pipelines. Data quality is a critical aspect of any data engineering project, as it ensures that the insights derived from the data are trustworthy and actionable.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Data Quality Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Data quality directly impacts the reliability of analytics and decision-making. Poor data quality can lead to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inaccurate Insights&lt;/strong&gt;: Misleading trends and correlations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Inefficiencies&lt;/strong&gt;: Wasted time debugging and fixing issues downstream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loss of Trust&lt;/strong&gt;: Stakeholders losing confidence in the data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For InsightFlow, ensuring data quality was essential to accurately analyze retail sales trends and their correlation with fuel prices.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Quality Framework for InsightFlow&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The data quality framework for InsightFlow was implemented at multiple stages of the pipeline, from ingestion to transformation and analysis. Below are the key components:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Data Validation During Ingestion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The ingestion layer, implemented using &lt;strong&gt;AWS Batch&lt;/strong&gt;, includes basic validation checks to ensure the raw data meets expected formats and structures.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Validation Steps&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;File Format Validation&lt;/strong&gt;: Ensures that ingested files are in the expected format (e.g., Parquet or CSV).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Validation&lt;/strong&gt;: Confirms that the files contain the required columns with the correct data types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Null Checks&lt;/strong&gt;: Flags missing or null values in critical columns.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example: Python Validation Script&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required_columns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;missing_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;required_columns&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;missing_columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Missing required columns: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;missing_columns&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Null values detected in the dataset.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;2. Data Quality in Transformation (dbt)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The transformation layer, implemented using &lt;strong&gt;dbt&lt;/strong&gt;, includes robust data quality checks through &lt;strong&gt;schema tests&lt;/strong&gt; and &lt;strong&gt;custom tests&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Schema Tests&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Schema tests ensure that the data adheres to predefined rules. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not Null&lt;/strong&gt;: Ensures critical columns (e.g., &lt;code&gt;sales_value_rm_mil&lt;/code&gt;) are not null.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unique&lt;/strong&gt;: Ensures unique values in primary key columns (e.g., &lt;code&gt;date_key&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relationships&lt;/strong&gt;: Validates foreign key relationships between fact and dimension tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example: Schema Test for &lt;code&gt;fct_retail_sales_monthly&lt;/code&gt;&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fct_retail_sales_monthly&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monthly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fact&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;combining&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;retail&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;average&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fuel&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prices."&lt;/span&gt;
    &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;date_key&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Foreign&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;dimension."&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;relationships&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ref('dim_date')&lt;/span&gt;
              &lt;span class="na"&gt;field&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;date_key&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sales_value_rm_mil&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monthly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;RM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;millions."&lt;/span&gt;
        &lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;dbt_utils.expression_is_true&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales_value_rm_mil&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Custom Tests&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Custom tests were implemented using the &lt;code&gt;dbt-utils&lt;/code&gt; package to validate business-specific rules. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Price Range Validation&lt;/strong&gt;: Ensures fuel prices are within a reasonable range.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume Index Validation&lt;/strong&gt;: Ensures volume indices are non-negative.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;3. Monitoring and Alerts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To ensure ongoing data quality, monitoring and alerting mechanisms were implemented using &lt;strong&gt;CloudWatch&lt;/strong&gt; and &lt;strong&gt;Kestra&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;CloudWatch Metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Glue Crawler Logs&lt;/strong&gt;: Monitors schema changes and ingestion errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Athena Query Logs&lt;/strong&gt;: Tracks query performance and errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Kestra Workflow Alerts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Kestra workflows include error handling and notifications for failed tasks. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If a Glue Crawler fails, an alert is sent to the team via email or Slack.&lt;/li&gt;
&lt;li&gt;If a dbt test fails, the pipeline halts, and the issue is logged for debugging.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Best Practices for Data Quality&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Define Clear Data Quality Rules&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Collaborate with stakeholders to define rules for each dataset (e.g., required columns, valid ranges).&lt;/li&gt;
&lt;li&gt;Document these rules in a central repository for easy reference.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Automate Data Quality Checks&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use tools like dbt to automate schema and custom tests.&lt;/li&gt;
&lt;li&gt;Integrate validation scripts into the ingestion pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Monitor Data Quality Continuously&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Set up dashboards to monitor key metrics (e.g., null values, schema changes).&lt;/li&gt;
&lt;li&gt;Use alerts to notify the team of issues in real time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Handle Data Quality Issues Proactively&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Implement retry mechanisms for transient errors (e.g., network issues during ingestion).&lt;/li&gt;
&lt;li&gt;Log all data quality issues for auditing and debugging.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Test Data Quality Regularly&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Schedule regular tests to ensure data quality rules are enforced.&lt;/li&gt;
&lt;li&gt;Use historical data to validate new rules and identify anomalies.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Challenges and Lessons Learned&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Managing schema changes in Glue required careful configuration of the &lt;code&gt;schema_change_policy&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom Tests&lt;/strong&gt;: Writing custom tests for business-specific rules required collaboration with domain experts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert Fatigue&lt;/strong&gt;: Fine-tuning alerts was necessary to avoid overwhelming the team with non-critical notifications.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Implementing robust data quality practices is essential for building reliable data pipelines. By integrating validation checks, schema tests, and monitoring mechanisms, InsightFlow ensures that its data is accurate, consistent, and trustworthy. These practices not only improve the quality of insights but also build confidence among stakeholders.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>dbt</category>
    </item>
    <item>
      <title>InsightFlow Part 6: Implementing ETL Processes with AWS Glue for InsightFlow</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Tue, 29 Apr 2025 02:44:42 +0000</pubDate>
      <link>https://dev.to/pizofreude/insightflow-part-6-implementing-etl-processes-with-aws-glue-for-insightflow-1gaj</link>
      <guid>https://dev.to/pizofreude/insightflow-part-6-implementing-etl-processes-with-aws-glue-for-insightflow-1gaj</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/pizofreude/insightflow-retail-economic-pipeline" rel="noopener noreferrer"&gt;InsightFlow GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post, we’ll explore how &lt;strong&gt;AWS Glue&lt;/strong&gt; was used to implement the &lt;strong&gt;ETL (Extract, Transform, Load)&lt;/strong&gt; processes for the &lt;strong&gt;InsightFlow&lt;/strong&gt; project. AWS Glue provides a serverless, fully managed environment for building and running ETL pipelines, making it an ideal choice for transforming raw data into a structured, queryable format.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why AWS Glue?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;AWS Glue simplifies the process of building ETL pipelines by offering the following key features:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Serverless Architecture&lt;/strong&gt;: No need to manage infrastructure; Glue automatically provisions resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Discovery&lt;/strong&gt;: Automatically detects and catalogs data schemas using the &lt;strong&gt;Glue Data Catalog&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with AWS Services&lt;/strong&gt;: Seamlessly integrates with S3, Athena, and other AWS services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Automatically scales to handle large datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Efficiency&lt;/strong&gt;: Pay only for the resources used during ETL jobs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For InsightFlow, AWS Glue was used to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discover and catalog raw data stored in S3.&lt;/li&gt;
&lt;li&gt;Transform raw data into a structured format.&lt;/li&gt;
&lt;li&gt;Load the transformed data into a partitioned data warehouse layer in S3.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Overview of the ETL Workflow&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The ETL process in InsightFlow involves the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Extract&lt;/strong&gt;: Fetch raw data from S3 buckets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transform&lt;/strong&gt;: Clean, normalize, and enrich the data using Glue jobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load&lt;/strong&gt;: Write the transformed data back to S3 in a partitioned format for efficient querying with Athena.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Components&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Glue Data Catalog&lt;/strong&gt;: Stores metadata about the raw and transformed datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue Crawlers&lt;/strong&gt;: Automatically discover schemas and update the Data Catalog.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glue Jobs&lt;/strong&gt;: Perform the actual data transformations.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Setting Up the Glue Data Catalog&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Glue Data Catalog&lt;/strong&gt; acts as a central repository for metadata about the datasets. It enables Athena to query the data without requiring explicit schema definitions.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Defining the Glue Database&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A Glue database was created to organize the tables for the project. Here’s the Terraform configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_glue_catalog_database"&lt;/span&gt; &lt;span class="s2"&gt;"dbt_database"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"insightflow_prod"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;
    &lt;span class="nx"&gt;Project&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"InsightFlow"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Discovering Data with Glue Crawlers&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Glue Crawlers were used to automatically discover the schema of raw data stored in S3 and populate the Data Catalog.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Configuring the Glue Crawler&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The crawler scans the &lt;code&gt;raw&lt;/code&gt; S3 bucket and creates tables in the Glue database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_glue_crawler"&lt;/span&gt; &lt;span class="s2"&gt;"raw_data_crawler"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"insightflow-prod-raw-data-crawler"&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;glue_crawler_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;database_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_glue_catalog_database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dbt_database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;

  &lt;span class="nx"&gt;s3_target&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3://insightflow-prod-raw-data/raw/"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;schema_change_policy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;update_behavior&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"UPDATE_IN_DATABASE"&lt;/span&gt;
    &lt;span class="nx"&gt;delete_behavior&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"LOG"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"prod"&lt;/span&gt;
    &lt;span class="nx"&gt;Project&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"InsightFlow"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Running the Crawler&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The crawler is triggered using the AWS CLI or programmatically via the AWS SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws glue start-crawler &lt;span class="nt"&gt;--name&lt;/span&gt; insightflow-prod-raw-data-crawler

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the crawler completes, the raw data is available as tables in the Glue Data Catalog.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Transforming Data with Glue Jobs&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Glue jobs were used to clean, normalize, and enrich the raw data. These jobs are written in &lt;strong&gt;PySpark&lt;/strong&gt;, allowing for scalable, distributed data processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example Glue Job: Aggregating Fuel Prices&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The following Glue job aggregates weekly fuel prices into monthly averages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;month&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Spark session
&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkContext&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read raw fuel price data from S3
&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://insightflow-prod-raw-data/raw/fuelprice/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Aggregate weekly prices to monthly averages
&lt;/span&gt;&lt;span class="n"&gt;monthly_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ymd_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;month&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ymd_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \\
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ron95&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_ron95_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ron97&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_ron97_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;diesel&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_diesel_price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write the transformed data back to S3
&lt;/span&gt;&lt;span class="n"&gt;monthly_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://insightflow-prod-processed-data/fuelprice_monthly/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Scheduling Glue Jobs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Glue jobs can be scheduled to run at regular intervals using &lt;strong&gt;Glue Triggers&lt;/strong&gt; or external orchestration tools like &lt;strong&gt;Kestra&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Querying Transformed Data with Athena&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The transformed data is stored in a partitioned format in the &lt;code&gt;processed&lt;/code&gt; S3 bucket. Athena queries can leverage these partitions for efficient data retrieval.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example Query&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Here’s an example query to analyze the correlation between retail sales and fuel prices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales_value_rm_mil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_ron95_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_ron97_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;avg_diesel_price&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
    &lt;span class="n"&gt;fct_retail_sales_monthly&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt;
    &lt;span class="n"&gt;fuelprice_monthly&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2025&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Challenges and Lessons Learned&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Schema Evolution&lt;/strong&gt;: Managing schema changes in Glue required careful configuration of the &lt;code&gt;schema_change_policy&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partitioning&lt;/strong&gt;: Proper partitioning of the transformed data significantly improved query performance in Athena.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Permissions&lt;/strong&gt;: Ensuring the Glue job role had the necessary permissions to access S3 and the Data Catalog was critical.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>etl</category>
      <category>aws</category>
    </item>
    <item>
      <title>InsightFlow Part 5: Designing the Data Model &amp; Schema with dbt for InsightFlow</title>
      <dc:creator>Pizofreude</dc:creator>
      <pubDate>Tue, 29 Apr 2025 02:30:49 +0000</pubDate>
      <link>https://dev.to/pizofreude/insightflow-part-5-designing-the-data-model-schema-with-dbt-for-insightflow-3fb6</link>
      <guid>https://dev.to/pizofreude/insightflow-part-5-designing-the-data-model-schema-with-dbt-for-insightflow-3fb6</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/pizofreude/insightflow-retail-economic-pipeline" rel="noopener noreferrer"&gt;InsightFlow GitHub Repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this post, we’ll dive into how the &lt;strong&gt;data model and schema&lt;/strong&gt; for the InsightFlow project were designed using &lt;strong&gt;dbt (Data Build Tool)&lt;/strong&gt;. This layer is critical for transforming raw data into a structured, analysis-ready format that supports efficient querying and visualization. We’ll also explore the &lt;strong&gt;Entity-Relationship Diagram (ERD)&lt;/strong&gt; for the project, which provides a visual representation of the relationships between the key entities in the data model.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why dbt for Data Modeling?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;dbt is a powerful tool for transforming raw data into a structured format using SQL. It enables data engineers to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Standardize Transformations&lt;/strong&gt;: Define reusable SQL models for data cleaning, normalization, and enrichment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version Control&lt;/strong&gt;: Manage transformations as code in Git for collaboration and reproducibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test Data Quality&lt;/strong&gt;: Add tests to ensure data integrity at every stage of the pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize for Querying&lt;/strong&gt;: Materialize models as views or tables, partitioned and optimized for querying in AWS Athena.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For InsightFlow, dbt was the ideal choice to transform raw retail and fuel price data into a &lt;strong&gt;star schema&lt;/strong&gt; that supports analysis of trends and correlations.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Overview of the Data Model&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The InsightFlow data model is designed as a &lt;strong&gt;star schema&lt;/strong&gt;, with the following key components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fact Table&lt;/strong&gt;: Contains quantitative metrics, such as sales values and fuel prices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimension Tables&lt;/strong&gt;: Provide descriptive attributes, such as MSIC group codes and dates, to slice and dice the data.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Tables&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fact Table&lt;/strong&gt;: &lt;code&gt;fct_retail_sales_monthly&lt;/code&gt;

&lt;ul&gt;
&lt;li&gt;Metrics: Sales values, volume indices, fuel prices.&lt;/li&gt;
&lt;li&gt;Partitioned by: Year and month for efficient querying in Athena.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Dimension Tables&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dim_msic_lookup&lt;/code&gt;: Provides descriptions for MSIC group codes.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dim_date&lt;/code&gt;: A date dimension table for time-based analysis.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Defining Sources&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The raw data ingested into the &lt;strong&gt;landing zone&lt;/strong&gt; (S3 bucket) is defined as &lt;strong&gt;sources&lt;/strong&gt; in dbt. These sources are created by the Glue Crawler and include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;iowrt&lt;/strong&gt;: Headline wholesale and retail trade data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iowrt_3d&lt;/strong&gt;: Detailed wholesale and retail trade data by MSIC group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fuelprice&lt;/strong&gt;: Weekly fuel price data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s how the sources are defined in sources.yml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;

&lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;landing_zone&lt;/span&gt;
    &lt;span class="na"&gt;schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;insightflow_prod&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;loaded&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;from&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data.gov.my&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;sources&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;via&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AWS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Batch&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ingestion."&lt;/span&gt;
    &lt;span class="na"&gt;tables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;iowrt&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Headline&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Wholesale&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Retail&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Trade&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(monthly)."&lt;/span&gt;
        &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;series&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Series&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;('abs',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'growth_yoy',&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'growth_mom')"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ymd_date&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Date&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;record&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(YYYY-MM-DD,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;monthly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;frequency)"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sales&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sales&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(RM&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;mil)"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Volume&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Index&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(base&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2015&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;100)"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fuelprice&lt;/span&gt;
        &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;weekly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fuel&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;price&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;data."&lt;/span&gt;
        &lt;span class="na"&gt;columns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ron95&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RON95&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Price&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(RM/litre)"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ron97&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RON97&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Price&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(RM/litre)"&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;diesel&lt;/span&gt;
            &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Diesel&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Price&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(RM/litre)"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Creating Staging Models&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Staging models clean and standardize the raw data. For example, the stg_iowrt.sql model filters for absolute values and casts columns to appropriate data types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;source_data&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;select&lt;/span&gt;
        &lt;span class="n"&gt;series&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ymd_date&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;record_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;volume&lt;/span&gt;
    &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'landing_zone'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'iowrt'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
    &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;series&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'abs'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;select&lt;/span&gt;
    &lt;span class="n"&gt;record_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sales_value_rm_mil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;volume&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;double&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;volume_index&lt;/span&gt;
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;source_data&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Building the Fact Table&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The fact table, &lt;code&gt;fct_retail_sales_monthly&lt;/code&gt;, combines data from multiple sources (e.g., retail sales and fuel prices) into a single table. It is partitioned by year and month for efficient querying in Athena.&lt;/p&gt;

&lt;p&gt;Here’s the configuration in dbt_project.yml:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;fct_retail_sales_monthly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;+materialized&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;table&lt;/span&gt;
  &lt;span class="na"&gt;+partitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;year&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;month&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Adding Dimension Tables&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. MSIC Lookup Dimension&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;dim_msic_lookup&lt;/code&gt; table provides descriptions for MSIC group codes. It is created from a seed file (&lt;code&gt;msic_lookup.csv&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;seeds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;insightflow&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;msic_lookup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;+schema&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;raw_seeds&lt;/span&gt;
      &lt;span class="na"&gt;+file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;parquet&lt;/span&gt;
      &lt;span class="na"&gt;+column_types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;group_code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;varchar&lt;/span&gt;
        &lt;span class="na"&gt;desc_en&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;varchar&lt;/span&gt;
        &lt;span class="na"&gt;desc_bm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;varchar&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Date Dimension&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;dim_date&lt;/code&gt; table is a standard date dimension table that supports time-based analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Step 5: Testing and Documentation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;dbt allows you to add tests to ensure data quality. For example, you can test that the &lt;code&gt;sales&lt;/code&gt; column in the fact table is not null:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;not_null&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;accepted_values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;abs&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;growth_yoy&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;growth_mom&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additionally, dbt automatically generates documentation for your models, which can be viewed in a browser.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Entity-Relationship Diagram (ERD)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s the ERD for the InsightFlow data model:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fshti21z9mgo8jufo81hu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fshti21z9mgo8jufo81hu.png" alt="The Entity Relational Diagram (ERD) for InsightFlow" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can copy this diagram into &lt;a href="https://dbdiagram.io/d/681038261ca52373f5bce4c1" rel="noopener noreferrer"&gt;dbdiagram.io&lt;/a&gt; to visualize the relationships.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;By leveraging dbt, we transformed raw data into a structured, analysis-ready format. The star schema design ensures efficient querying and supports a wide range of analyses, from sales trends to fuel price correlations.&lt;/p&gt;

</description>
      <category>dezoomcamp</category>
      <category>dataengineering</category>
      <category>dbt</category>
      <category>datastructures</category>
    </item>
  </channel>
</rss>
