<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bob Otieno Okech</title>
    <description>The latest articles on DEV Community by Bob Otieno Okech (@bobokech).</description>
    <link>https://dev.to/bobokech</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2491157%2Fdde963c5-3600-4dd6-8fb7-515d018a2947.jpg</url>
      <title>DEV Community: Bob Otieno Okech</title>
      <link>https://dev.to/bobokech</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bobokech"/>
    <language>en</language>
    <item>
      <title>The Role of Data Warehousing and Dimensional Modeling in Building a Scalable Data Warehouse for AI Agents</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Mon, 09 Jun 2025 14:37:38 +0000</pubDate>
      <link>https://dev.to/bobokech/the-role-of-data-warehousing-and-dimensional-modeling-in-building-a-scalable-data-warehouse-for-ai-2nek</link>
      <guid>https://dev.to/bobokech/the-role-of-data-warehousing-and-dimensional-modeling-in-building-a-scalable-data-warehouse-for-ai-2nek</guid>
      <description>&lt;p&gt;In the rush to develop cutting-edge AI agents, one critical skill often gets overlooked: &lt;strong&gt;data warehousing&lt;/strong&gt;. While AI algorithms and machine learning frameworks grab the spotlight, the backbone of any scalable, high-performing AI system lies in how data is structured, stored, and accessed.&lt;/p&gt;

&lt;p&gt;A well-designed data warehouse, built on the principles of &lt;strong&gt;dimensional modeling&lt;/strong&gt;, ensures that AI agents can efficiently process vast amounts of data, deliver real-time insights, and adapt to evolving business needs. This article explores why data warehousing and dimensional modeling are indispensable for AI scalability and how their components work together to power intelligent systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Data Warehousing Matters for AI Agents
&lt;/h2&gt;

&lt;p&gt;AI agents thrive on data. Whether they're generating insights, making predictions, or automating decisions, they rely on &lt;strong&gt;clean, consistent, and accessible data&lt;/strong&gt;. A data warehouse serves as the centralized repository that organizes raw data into a format optimized for analysis, enabling AI agents to perform complex queries and deliver actionable results.&lt;/p&gt;

&lt;p&gt;Without a robust data warehousing strategy, AI systems risk being bogged down by inconsistent data, slow query performance, and scalability bottlenecks.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Benefits of Data Warehousing for AI
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: A data warehouse can handle massive datasets, ensuring AI agents can scale to meet growing demands.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: By integrating data from multiple sources, a warehouse provides a single source of truth, critical for accurate AI predictions.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt;: Optimized for analytical queries, data warehouses enable AI agents to process data quickly, even for complex, ad-hoc requests.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical Context&lt;/strong&gt;: AI models often require historical data for training and trend analysis, which data warehouses store efficiently.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What is a Data Warehouse?
&lt;/h2&gt;

&lt;p&gt;In the simplest terms, a &lt;strong&gt;data warehouse&lt;/strong&gt; is a central repository of information designed to enable and support &lt;strong&gt;business intelligence (BI)&lt;/strong&gt; activities, especially analytics.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Components of a Data Warehouse
&lt;/h2&gt;

&lt;p&gt;A data warehouse environment consists of four key components, each playing a critical role in supporting AI agents and other business stakeholders:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjl09v6jcqqb4c3n0ib04.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjl09v6jcqqb4c3n0ib04.png" alt="Datawarehouse Architecture" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Operational Source Systems
&lt;/h3&gt;

&lt;p&gt;Operational source systems capture the raw transactional data of a business, such as sales records, customer interactions, or inventory updates. These systems are optimized for &lt;strong&gt;transactional processing&lt;/strong&gt;, rather than analytical queries, and typically lack historical data or cross-system integration capabilities.&lt;/p&gt;

&lt;p&gt;These systems provide raw input data, but their &lt;strong&gt;stovepipe nature&lt;/strong&gt;—where data is siloed by application—poses challenges. A well-designed data warehouse extracts this data, transforming it into a format suitable for consumption.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Data Staging Area
&lt;/h3&gt;

&lt;p&gt;The data staging area is the "kitchen" of the data warehouse, where raw data is cleaned, transformed, and prepared for analysis. This &lt;strong&gt;extract-transform-load (ETL)&lt;/strong&gt; process involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Extraction&lt;/strong&gt;: Pulling data from operational systems.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transformation&lt;/strong&gt;: Cleansing (e.g., fixing misspellings, resolving conflicts), combining data from multiple sources, deduplicating, and assigning standardized keys.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loading&lt;/strong&gt;: Delivering the transformed data to the presentation area.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The staging area ensures data &lt;strong&gt;quality and consistency&lt;/strong&gt;, which are critical for training reliable models. However, this area is off-limits to users and AI queries to maintain security and focus on processing efficiency.&lt;/p&gt;

&lt;p&gt;While some organizations use normalized structures in staging, &lt;strong&gt;dimensional modeling&lt;/strong&gt; in the presentation area is key for scalability.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Data Presentation Area
&lt;/h3&gt;

&lt;p&gt;The data presentation area is where data is organized into &lt;strong&gt;dimensional models&lt;/strong&gt; (star schemas or cubes) for querying by AI agents, analytical tools, and business users. This area is the &lt;strong&gt;heart of the data warehouse&lt;/strong&gt;, designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User Understandability&lt;/strong&gt;: Dimensional models, with intuitive dimensions like product, market, and time, make it easy to navigate and process data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query Performance&lt;/strong&gt;: Star schemas optimize complex queries, enabling real-time insights.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic Data&lt;/strong&gt;: Storing granular, atomic data allows users and agents to answer precise, unpredictable questions.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conformed Dimensions&lt;/strong&gt;: Using shared dimensions and facts across data marts ensures consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;data warehouse bus architecture&lt;/strong&gt;, with conformed dimensions and facts, enables scalable, distributed systems. This is critical for AI agents that need to combine data from multiple domains (e.g., sales, marketing, and supply chain) to generate holistic insights.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Data Access Tools
&lt;/h3&gt;

&lt;p&gt;Data access tools, ranging from ad hoc query tools to sophisticated AI-driven analytics, interact with the presentation area to deliver insights. For AI agents, these tools include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ad Hoc Query Tools&lt;/strong&gt;: Allow AI agents to explore data dynamically.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytic Applications&lt;/strong&gt;: Prebuilt templates for common AI tasks, such as forecasting or customer segmentation.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Mining and Modeling Tools&lt;/strong&gt;: Enable AI agents to build and refine predictive models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By leveraging &lt;strong&gt;dimensional models&lt;/strong&gt; in the presentation area, these tools ensure that AI agents can access data efficiently, even for complex, iterative queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dimensional Modeling: The Key to Scalability
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqeuzddcqq6eyc88cop3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flqeuzddcqq6eyc88cop3.png" alt="Schemas" width="800" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dimensional modeling&lt;/strong&gt; is the cornerstone of a scalable data warehouse. Unlike normalized (3NF) models, which prioritize transactional efficiency and eliminate redundancy, dimensional models are designed for &lt;strong&gt;analytical simplicity and performance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's why dimensional modeling is critical for AI agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity&lt;/strong&gt;: Organizes data into intuitive structures (e.g., fact tables for metrics, dimension tables for context), making it easier for AI agents to process and interpret data.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance&lt;/strong&gt;: Star schemas reduce the complexity of joins, enabling faster query execution.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility&lt;/strong&gt;: Atomic data and conformed dimensions allow AI agents to handle unpredictable queries and adapt to changing business needs.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Avoiding Complexity&lt;/strong&gt;: Normalized models, with their intricate web of tables, are impractical for AI queries, leading to slow performance and user frustration.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example: Dimensional Modeling in Action
&lt;/h2&gt;

&lt;p&gt;Consider a retail AI agent analyzing sales performance. A dimensional model might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fact Table&lt;/strong&gt;: Sales transactions with metrics like revenue and quantity sold.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimension Tables&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Product (e.g., SKU, category)
&lt;/li&gt;
&lt;li&gt;Market (e.g., region, store)
&lt;/li&gt;
&lt;li&gt;Time (e.g., date, quarter)
&lt;/li&gt;
&lt;li&gt;Customer (e.g., demographics, purchase history)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The AI agent can quickly slice and dice this data to answer questions like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What were the sales for eco-friendly products in urban stores last quarter?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The dimensional structure ensures fast, accurate results, even for ad-hoc queries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Avoiding Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;Many data warehousing projects fail due to overemphasis on &lt;strong&gt;normalized structures&lt;/strong&gt; in the staging area or &lt;strong&gt;neglect of the presentation area&lt;/strong&gt;. These mistakes can be catastrophic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Overly Complex Schemas&lt;/strong&gt;: Normalized models in the presentation area lead to slow queries and frustrated users. Dimensional models are non-negotiable for scalability.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lack of Atomic Data&lt;/strong&gt;: Storing only aggregated data limits an AI agent's ability to drill down into granular details.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stovepipe Data Marts&lt;/strong&gt;: Without conforming dimensions, business users struggle to integrate data across business processes, leading to inconsistent insights.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Integrating Data Warehousing with AI Development
&lt;/h2&gt;

&lt;p&gt;To build scalable AI agents, data warehousing and dimensional modeling must be integrated into the development process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Design for Dimensional Modeling&lt;/strong&gt;: Prioritize star schemas in the presentation area.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Invest in ETL Processes&lt;/strong&gt;: Ensure clean, consistent data for AI training and inference.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Leverage Conformed Dimensions&lt;/strong&gt;: Enable AI agents to combine data across domains seamlessly.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize for Scalability&lt;/strong&gt;: Ensure the warehouse can handle growing data volumes and query complexity.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Balance Staging and Presentation&lt;/strong&gt;: Avoid over-investing in normalized staging at the expense of a robust presentation layer.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data warehousing and dimensional modeling are not just supporting acts—they are foundational&lt;/strong&gt; to building scalable, high-performing AI agents. By structuring data into intuitive, query-optimized dimensional models, organizations can empower AI agents to deliver real-time insights, adapt to changing needs, and scale effortlessly.&lt;/p&gt;

&lt;p&gt;As AI continues to transform industries, mastering data warehousing and dimensional modeling will be the hidden skill that sets successful AI projects apart.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Youtube API Project</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Sun, 04 May 2025 06:53:31 +0000</pubDate>
      <link>https://dev.to/bobokech/youtube-api-project-214n</link>
      <guid>https://dev.to/bobokech/youtube-api-project-214n</guid>
      <description>&lt;h2&gt;
  
  
  📊 YouTube API – Data Warehouse &amp;amp; Analytics Solution
&lt;/h2&gt;

&lt;p&gt;This repository demonstrates a complete data pipeline that extracts data from the &lt;strong&gt;YouTube Data API&lt;/strong&gt;, models it using the &lt;strong&gt;Medallion Architecture&lt;/strong&gt;, and delivers business-ready insights via &lt;strong&gt;Grafana dashboards&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  📦 Project Summary
&lt;/h2&gt;

&lt;p&gt;This project implements a modern analytics pipeline with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Medallion Architecture&lt;/strong&gt;: Structured into &lt;strong&gt;Bronze&lt;/strong&gt;, &lt;strong&gt;Silver&lt;/strong&gt;, and &lt;strong&gt;Gold&lt;/strong&gt; layers for scalable data processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ETL Workflows&lt;/strong&gt;: Automated extraction, transformation, and loading using &lt;strong&gt;Apache Airflow&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Modeling&lt;/strong&gt;: Dimensional modeling in PostgreSQL for optimized querying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboards&lt;/strong&gt;: Real-time reporting using &lt;strong&gt;Grafana&lt;/strong&gt;, powered by SQL.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧰 Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; – Central data warehouse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apache Airflow&lt;/strong&gt; – Workflow orchestration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; – Real-time data visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Linux VM&lt;/strong&gt; – Compute environment for pipeline execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; – API ingestion &amp;amp; transformation logic&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎯 Project Objectives
&lt;/h2&gt;

&lt;p&gt;Build a production-ready analytics solution to analyze YouTube channel and video performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Source structured data from the YouTube Data API&lt;/li&gt;
&lt;li&gt;Clean, validate, and model for business intelligence&lt;/li&gt;
&lt;li&gt;Persist historical metrics (views, likes, etc.) for trend analysis&lt;/li&gt;
&lt;li&gt;Deliver actionable insights via dashboards and SQL queries&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🗃️ Data Architecture (Medallion Model)
&lt;/h2&gt;

&lt;p&gt;This project follows a &lt;strong&gt;Bronze → Silver → Gold&lt;/strong&gt; pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7h14w1q0s19j7kd4bxr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs7h14w1q0s19j7kd4bxr.png" alt="Architecture" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🔹 Bronze Layer
&lt;/h3&gt;

&lt;p&gt;Raw ingestion from the YouTube API (JSON format)&lt;/p&gt;

&lt;h3&gt;
  
  
  🔸 Silver Layer
&lt;/h3&gt;

&lt;p&gt;Cleaned, validated, and structured data (see data flow and model below)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Flow&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5zcwddaic2eloxrbfgi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm5zcwddaic2eloxrbfgi.png" alt="DataFlow" width="781" height="731"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Model&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4cus0akmrdna6t7y7yqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4cus0akmrdna6t7y7yqf.png" alt="Data Model" width="800" height="545"&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟡 Gold Layer
&lt;/h3&gt;

&lt;p&gt;Aggregated data used to generate KPIs and dashboards in Grafana&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visualization Sample&lt;/strong&gt;
&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2e2goyx5dw6yvs6548p3.png" alt="Visualization" width="800" height="417"&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📈 BI Use Cases
&lt;/h2&gt;

&lt;p&gt;Dashboards and SQL queries answer key questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are the &lt;strong&gt;top-performing videos&lt;/strong&gt; per channel?&lt;/li&gt;
&lt;li&gt;How is each &lt;strong&gt;channel performing over time&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;What are the &lt;strong&gt;daily trends&lt;/strong&gt; for views and engagement?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📁 Repository Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;├── README.md
├── channel_lists.py
├── channel_overview.py
├── channel_videos.py
├── __pycache__/                        # Compiled Python files
├── project_files/
│   ├── Architecture/                  # Draw.io and PNG files for architecture
│   └── ddl_update_scripts/           # SQL DDLs and procedures
│       ├── dim_channels.sql
│       ├── dim_videos.sql
│       ├── fct_subscribers_views_video_count.sql
│       └── fct_video_statistics.sql
└── requirements.txt                   # Python dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔗 Access the Code
&lt;/h2&gt;

&lt;p&gt;Browse the full codebase &lt;a href="https://github.com/bobotieno1997/YoutubeAPI-Project/tree/main" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;




</description>
    </item>
    <item>
      <title>Airflow Xcoms</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Sun, 30 Mar 2025 08:53:56 +0000</pubDate>
      <link>https://dev.to/bobokech/airflow-xcoms-m40</link>
      <guid>https://dev.to/bobokech/airflow-xcoms-m40</guid>
      <description>&lt;h2&gt;
  
  
  What Are XComs in Airflow?
&lt;/h2&gt;

&lt;p&gt;XComs, short for &lt;strong&gt;Cross-Communications&lt;/strong&gt;, allow tasks in an Airflow DAG to share data with each other. They store small pieces of data (key-value pairs) in Airflow’s &lt;strong&gt;metadata database&lt;/strong&gt;, making it possible for one task to push data and another task to retrieve it later.  &lt;/p&gt;

&lt;p&gt;When using a &lt;strong&gt;PythonOperator&lt;/strong&gt;, Airflow automatically creates an XCom if the function returns a value. However, the way Airflow names and stores these values might not always be intuitive. To have more control over the XCom, you can use the &lt;strong&gt;task instance (&lt;code&gt;ti&lt;/code&gt;) object&lt;/strong&gt; and specify a custom key when pushing data.  &lt;/p&gt;

&lt;p&gt;Let's look at a simple example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# dags/xcom_dag.py
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.bash&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_training_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s"&gt;s accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_choose_best_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choose best model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;xcom_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;downloading_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;downloading_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sleep 3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;training_model_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_model_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_training_model&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

    &lt;span class="n"&gt;choose_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choose_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_choose_best_model&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;downloading_data&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;training_model_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;choose_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the above code is executed in Airflow, navigating to the XCom tables will show a newly generated XCom by default, as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4s90lql2m1bt2jdvn91.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4s90lql2m1bt2jdvn91.png" alt="Auto-Generated DAG" width="800" height="179"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By default, &lt;strong&gt;BashOperator&lt;/strong&gt; stores the command's output in XCom. However, if you don't need this behavior, you can set &lt;code&gt;do_xcom_push=False&lt;/code&gt; to prevent unnecessary data storage. This helps keep the Airflow metadata database clean and optimized.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using XComs for Data Sharing
&lt;/h3&gt;

&lt;p&gt;We can modify the code by explicitly pushing and pulling XComs using the task instance (&lt;code&gt;ti&lt;/code&gt;). This allows tasks to share data efficiently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# dags/xcom_dag.py
&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DAG&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.bash&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BashOperator&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.operators.python&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PythonOperator&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uniform&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;

&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2020&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_training_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="se"&gt;\'&lt;/span&gt;&lt;span class="s"&gt;s accuracy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model_accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_choose_best_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choose best model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;accuracies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ti&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model_accuracy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_model_A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_model_B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_model_C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;accuracies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;DAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;xcom_dag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;schedule_interval&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;@daily&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;default_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
         &lt;span class="n"&gt;catchup&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

    &lt;span class="n"&gt;downloading_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;downloading_data&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sleep 3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;do_xcom_push&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;  &lt;span class="c1"&gt;# Prevents unnecessary XCom storage
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;training_model_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;training_model_&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_training_model&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;

    &lt;span class="n"&gt;choose_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choose_model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;python_callable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_choose_best_model&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;downloading_data&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;training_model_task&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;choose_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This DAG demonstrates how XComs facilitate data sharing between tasks in Airflow. It starts with a &lt;code&gt;BashOperator&lt;/code&gt; task (&lt;code&gt;downloading_data&lt;/code&gt;) that simulates data preparation. Next, three &lt;code&gt;PythonOperator&lt;/code&gt; tasks (&lt;code&gt;training_model_A&lt;/code&gt;, &lt;code&gt;training_model_B&lt;/code&gt;, and &lt;code&gt;training_model_C&lt;/code&gt;) generate random accuracy scores and push them to XCom using &lt;code&gt;ti.xcom_push(key='model_accuracy', value=accuracy)&lt;/code&gt;. Finally, the &lt;code&gt;choose_model&lt;/code&gt; task retrieves these values using &lt;code&gt;ti.xcom_pull(key='model_accuracy', task_ids=['training_model_A', 'training_model_B', 'training_model_C'])&lt;/code&gt; and prints them. This process enables tasks to exchange data without direct dependencies, with values stored in Airflow’s metadata database for later retrieval.&lt;/p&gt;

&lt;p&gt;As a proof of success, the screenshot below shows the result logs for the &lt;code&gt;choose_model&lt;/code&gt; task, where the accuracy scores are printed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2mh2hr1xztimr1f3eo5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2mh2hr1xztimr1f3eo5.png" alt="Accuracy Score" width="800" height="199"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  XCom Limitations
&lt;/h3&gt;

&lt;p&gt;While XComs are powerful, they do have limitations. Since Airflow is an &lt;strong&gt;orchestration tool&lt;/strong&gt;, not a data processing tool, handling large volumes of data via XComs can be inefficient. The size limitations depend on the metadata database being used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MySQL&lt;/strong&gt;: ~64 KB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt;: ~1 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQLite&lt;/strong&gt;: ~2 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For larger data transfers, consider using external storage solutions like &lt;strong&gt;Amazon S3, Google Cloud Storage, or Azure Blob Storage&lt;/strong&gt;, and pass only references (e.g., file paths) via XComs.&lt;/p&gt;




&lt;p&gt;With this understanding, you can now confidently use XComs in Airflow to facilitate data sharing between tasks while ensuring efficiency and best practices.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>An Introduction to AWS Virtual Private Cloud</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Sat, 29 Mar 2025 12:12:35 +0000</pubDate>
      <link>https://dev.to/bobokech/an-introduction-to-aws-virtual-private-cloud-1lid</link>
      <guid>https://dev.to/bobokech/an-introduction-to-aws-virtual-private-cloud-1lid</guid>
      <description>&lt;h2&gt;
  
  
  What is a Virtual Private Cloud (VPC)?
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;Virtual Private Cloud (VPC)&lt;/strong&gt; is an isolated network within a cloud environment that allows you to securely manage and control your resources. It provides a dedicated, logically separated space for applications, ensuring enhanced security, scalability, and efficient network traffic management.  &lt;/p&gt;

&lt;h2&gt;
  
  
  What are Subnets?
&lt;/h2&gt;

&lt;p&gt;Subnets are subdivisions of a VPC, each assigned a specific range of IP addresses. They help segment the network to improve organization, security, and performance by isolating workloads based on functionality, access requirements, or geographic location. Subnets can be &lt;strong&gt;public&lt;/strong&gt; (accessible from the internet) or &lt;strong&gt;private&lt;/strong&gt; (restricted to internal communication within the VPC).  &lt;/p&gt;

&lt;h2&gt;
  
  
  How to Set Up Your VPC
&lt;/h2&gt;

&lt;p&gt;To create a VPC, log in to the AWS Management Console and configure it as shown below:  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu9giybdor3ydbe8twdq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu9giybdor3ydbe8twdq.png" alt="VPC Configurations" width="800" height="623"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the example above, we’ve named our VPC &lt;strong&gt;"demo-vpc"&lt;/strong&gt; and assigned it a &lt;strong&gt;CIDR range of 10.0.0.0/16&lt;/strong&gt;.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding CIDR (Classless Inter-Domain Routing)
&lt;/h3&gt;

&lt;p&gt;CIDR notation defines the IP address range for your VPC. In this case:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The CIDR block &lt;strong&gt;10.0.0.0/16&lt;/strong&gt; means that the first two octets (10.0) represent the &lt;strong&gt;network ID&lt;/strong&gt;, which remains unchanged.
&lt;/li&gt;
&lt;li&gt;The last two octets (.0.0) represent the &lt;strong&gt;host ID&lt;/strong&gt;, which can be modified to allocate IP addresses within the VPC.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common private IP ranges include:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10.0.0.0/8&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;172.16.0.0/12&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;192.168.0.0/16&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In AWS, CIDR notation follows the &lt;strong&gt;IP address/mask&lt;/strong&gt; format, where the mask determines the number of available host addresses in the network.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Creating Public and Private Subnets
&lt;/h3&gt;

&lt;p&gt;Best practice is to create both &lt;strong&gt;public&lt;/strong&gt; and &lt;strong&gt;private&lt;/strong&gt; subnets in different &lt;strong&gt;Availability Zones&lt;/strong&gt; for high availability. However, for simplicity, we will create both in a single Availability Zone.  &lt;/p&gt;

&lt;h4&gt;
  
  
  Public Subnet Configuration
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdntt4r2kfbfumqese8e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdntt4r2kfbfumqese8e.png" alt="Public Subnet Configuration" width="800" height="559"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Private Subnet Configuration
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fts47hu3lclakhi6jh6h4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fts47hu3lclakhi6jh6h4.png" alt="Private Subnet Configuration" width="800" height="561"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To launch resources in your VPC, you must have at least one subnet configured. AWS automatically creates a default subnet when launching an EC2 instance unless specified otherwise.  &lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying EC2 Instances
&lt;/h2&gt;

&lt;p&gt;Now, let's create two EC2 instances: one in the &lt;strong&gt;public subnet&lt;/strong&gt; and the other in the &lt;strong&gt;private subnet&lt;/strong&gt;. Choose a free-tier eligible instance type and use the following configurations.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Public EC2 Instance Configuration
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrp2gxvuuw4xuhzn553d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrp2gxvuuw4xuhzn553d.png" alt="Public EC2 Configuration" width="800" height="759"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Select the &lt;strong&gt;demo-vpc&lt;/strong&gt; and assign the instance to the &lt;strong&gt;public subnet&lt;/strong&gt;. However, trying to connect to it will fail because the subnet is not yet configured for external access.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagr8l8tm2kkoekwqjvc5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fagr8l8tm2kkoekwqjvc5.png" alt="Public Subnet Error" width="800" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring Internet Access
&lt;/h2&gt;

&lt;p&gt;To enable internet access, we need to configure:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Internet Gateway&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public Route Table&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Creating a Public Route Table
&lt;/h3&gt;

&lt;p&gt;A routing table is created by default when a VPC is created, but we need a separate one for external traffic.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsrshbwk6glp1ejyf3gnh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsrshbwk6glp1ejyf3gnh.png" alt="Public Route Table" width="800" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating and Attaching an Internet Gateway
&lt;/h3&gt;

&lt;p&gt;AWS allows only &lt;strong&gt;one Internet Gateway per VPC&lt;/strong&gt;. Create and attach it to the &lt;strong&gt;demo-vpc&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0stiw0g38x3spq8uyrrv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0stiw0g38x3spq8uyrrv.png" alt="Creating an Internet Gateway" width="800" height="257"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Updating the Public Route Table
&lt;/h3&gt;

&lt;p&gt;Edit the &lt;strong&gt;public route table&lt;/strong&gt; to add the &lt;strong&gt;Internet Gateway&lt;/strong&gt; as a default route for external traffic (0.0.0.0/0).  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftrzfx1u96hacpb702t4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftrzfx1u96hacpb702t4q.png" alt="Route Tables Adding VPC" width="800" height="170"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Assigning the Public Subnet to the Route Table
&lt;/h3&gt;

&lt;p&gt;Associate the &lt;strong&gt;public subnet&lt;/strong&gt; with the &lt;strong&gt;public route table&lt;/strong&gt; to finalize external access configuration.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbk39un1zc8zrs4aesdp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjbk39un1zc8zrs4aesdp.png" alt="Assign Routing Table to Subnet" width="800" height="240"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, the error disappears, and you can now SSH into your EC2 instance.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswtieq7osbsn1e0mjbue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fswtieq7osbsn1e0mjbue.png" alt="Success Instance Connection" width="800" height="682"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuring a Private EC2 Instance
&lt;/h2&gt;

&lt;p&gt;Since the two instances are in the same VPC, you can connect from the &lt;strong&gt;public EC2 instance&lt;/strong&gt; to the &lt;strong&gt;private EC2 instance&lt;/strong&gt;. However, the private instance cannot access the internet for updates. To resolve this, we will:  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Create a Private Route Table&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set Up a NAT Gateway in the Public Subnet&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Creating a NAT Gateway
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;NAT Gateway&lt;/strong&gt; allows the private instance to access the internet while preventing inbound connections.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5w7qb44rk9npj96vt13.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc5w7qb44rk9npj96vt13.png" alt="NAT Gateway" width="800" height="394"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Configuring the Private Route Table
&lt;/h3&gt;

&lt;p&gt;Update the private route table to route traffic through the &lt;strong&gt;NAT Gateway&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4zdxqjgm6krio490rkf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb4zdxqjgm6krio490rkf.png" alt="Private Route Table Update" width="800" height="193"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing Private Instance Internet Access
&lt;/h3&gt;

&lt;p&gt;Transfer the &lt;strong&gt;.pem&lt;/strong&gt; file to the public instance and SSH into the private instance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scp &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"key_pair.pem"&lt;/span&gt; key_pair.pem ubuntu@&amp;lt;public-instance-ip&amp;gt;:/home/ubuntu/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once transferred, update file permissions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;chmod &lt;/span&gt;400 key_pair.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, SSH into the private instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ssh &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"key_pair.pem"&lt;/span&gt; ubuntu@&amp;lt;private-instance-ip&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verifying Internet Connectivity
&lt;/h3&gt;

&lt;p&gt;Run the following command on the &lt;strong&gt;private instance&lt;/strong&gt; to check internet access:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If successful, the output should confirm package updates.  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs9gksljrq2xc6nmdei5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzs9gksljrq2xc6nmdei5.png" alt="Successful Update" width="749" height="340"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Congratulations! 🎉 You have successfully set up a &lt;strong&gt;VPC&lt;/strong&gt; with &lt;strong&gt;public and private subnets&lt;/strong&gt;, deployed EC2 instances, configured &lt;strong&gt;internet access for public instances&lt;/strong&gt;, and enabled &lt;strong&gt;outbound access for private instances&lt;/strong&gt; using a &lt;strong&gt;NAT Gateway&lt;/strong&gt;. This architecture improves security while allowing necessary updates for private instances. &lt;/p&gt;

&lt;p&gt;With this foundation, you can now explore more AWS networking concepts such as &lt;strong&gt;VPC Peering, VPNs, and AWS Transit Gateway&lt;/strong&gt; to further optimize your cloud infrastructure. Happy cloud computing! ☁️🚀&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Understanding Apache Kafka: The Backbone of Real-Time Data Streaming</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Tue, 11 Mar 2025 11:14:40 +0000</pubDate>
      <link>https://dev.to/bobokech/understanding-apache-kafka-the-backbone-of-real-time-data-streaming-2k07</link>
      <guid>https://dev.to/bobokech/understanding-apache-kafka-the-backbone-of-real-time-data-streaming-2k07</guid>
      <description>&lt;p&gt;In today’s digital world, data is being generated at an unprecedented rate, especially in the realm of real-time streaming. From financial transactions and social media feeds to IoT devices and system logs, businesses rely on the continuous flow of data to drive decision-making, enhance user experiences, and improve operational efficiency. However, handling large-scale, high-velocity data streams requires a robust and scalable solution.&lt;/p&gt;

&lt;p&gt;Enter &lt;strong&gt;Apache Kafka&lt;/strong&gt;—an open-source event streaming platform designed to process and manage real-time data efficiently. In this guide, we’ll explore what Kafka is, how it works, and why it has become the go-to solution for real-time data streaming across industries. We’ll first start by understanding the main components of Kafka and how each work together to ensure scalability, durability, maintainability, and fault tolerance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kafka’s Core Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxb17elwidizuozr3uwn2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxb17elwidizuozr3uwn2.jpg" alt="apache kafka architecture" width="720" height="376"&gt;&lt;/a&gt;&lt;br&gt;
The overall architecture of Apache Kafka revolves around a few key components. &lt;em&gt;Producers&lt;/em&gt;, the data sources, push message streams to &lt;em&gt;Kafka brokers&lt;/em&gt;—servers that act as intermediaries between producers and consumers. These messages are organized into &lt;em&gt;topics&lt;/em&gt;, unique identifiers for data streams, which can be split into &lt;em&gt;partitions&lt;/em&gt; to distribute large data volumes across multiple machines in a &lt;em&gt;cluster&lt;/em&gt;. A Kafka &lt;em&gt;cluster&lt;/em&gt;, a group of brokers working together, can range from a single-broker setup to a multi-broker configuration for enhanced scalability. &lt;/p&gt;

&lt;p&gt;Each message in a partition is tagged with an &lt;em&gt;offset&lt;/em&gt;, a sequential number unique to that partition, allowing &lt;em&gt;consumers&lt;/em&gt; to track and process data in order. Consumers, often grouped into &lt;em&gt;consumer groups&lt;/em&gt; to share workloads, subscribe to topics and pull messages from brokers, with offsets ensuring no message is processed twice within the same group. Coordinating this distributed system is &lt;em&gt;Zookeeper&lt;/em&gt;, which tracks cluster nodes, topics, partitions, and offsets, ensuring seamless operation.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Role of the Broker
&lt;/h3&gt;

&lt;p&gt;The &lt;em&gt;broker&lt;/em&gt;—In Kafka, a broker is basically a running server. It’s an intermediary between two applications that depend on each other, receiving data from producers and delivering it to consumers with the right permissions. This architecture makes Kafka highly scalable, durable, and fault-tolerant, capable of handling real-time demands across industries.&lt;/p&gt;
&lt;h2&gt;
  
  
  Running Kafka in Your Environment
&lt;/h2&gt;

&lt;p&gt;To get started with Apache Kafka, Zookeeper is recommended for optimal compatibility. Kafka isn’t natively designed for Windows, so using WSL2 (Windows 10 or later) or Docker is advised.&lt;br&gt;
Here’s how to set it up on Windows with WSL2:&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1: Install WSL2
&lt;/h3&gt;

&lt;p&gt;WSL2 (Windows Subsystem for Linux 2) provides a Linux environment on Windows without a VM. Ensure you’re on Windows 10 version 2004 or higher (check with &lt;code&gt;winver&lt;/code&gt;). Run this command in an admin PowerShell or Command Prompt, then restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wsl &lt;span class="nt"&gt;--install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Follow prompts to set up a Linux distribution (e.g., Ubuntu) and create a user account. Refer to &lt;a href="https://docs.microsoft.com/en-us/windows/wsl/" rel="noopener noreferrer"&gt;Microsoft Docs&lt;/a&gt; if needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Install Java
&lt;/h3&gt;

&lt;p&gt;Kafka requires Java 11 or 17.&lt;br&gt;
You can find out the available java in your console by running this command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;java &lt;span class="nt"&gt;--install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can download either of the versions by running this command&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;openjdk-17-jre-headless &lt;span class="c"&gt;#for java 17&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;if it’s not already on your system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Install Apache Kafka
&lt;/h3&gt;

&lt;p&gt;Download the latest stable version (e.g., 3.7.0 as of February 27, 2024) from the &lt;a href="https://kafka.apache.org/downloads" rel="noopener noreferrer"&gt;Kafka downloads page&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Run this command to download kafka 3.7.0&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget download_url_from_kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;then run this command to  unzip the zipped file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xzf&lt;/span&gt; downloaded_zipped_file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Start Zookeeper
&lt;/h3&gt;

&lt;p&gt;Zookeeper, bundled with Kafka, manages the cluster. From the Kafka root directory in a command prompt, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin&lt;span class="se"&gt;\z&lt;/span&gt;ookeeper-server-start.sh config&lt;span class="se"&gt;\z&lt;/span&gt;ookeeper.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Start the Kafka Server
&lt;/h3&gt;

&lt;p&gt;In a new command prompt from the Kafka root, launch the server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin&lt;span class="se"&gt;\k&lt;/span&gt;afka-server-start.sh config&lt;span class="se"&gt;\s&lt;/span&gt;erver.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Create a Topic
&lt;/h3&gt;

&lt;p&gt;Create a topic named “MyFirstTopic” with this command in a new prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin&lt;span class="se"&gt;\k&lt;/span&gt;afka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--topic&lt;/span&gt; MyFirstTopic &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm with “Created topic MyFirstTopic.”&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 7: Start a Producer
&lt;/h3&gt;

&lt;p&gt;Launch a producer to send messages to the topic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-console-producer.sh &lt;span class="nt"&gt;--topic&lt;/span&gt; MyFirstTopic &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 8: Start a Consumer
&lt;/h3&gt;

&lt;p&gt;In another prompt, start a consumer to read messages in real-time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;bin/kafka-console-consumer.sh &lt;span class="nt"&gt;--topic&lt;/span&gt; MyFirstTopic &lt;span class="nt"&gt;--from-beginning&lt;/span&gt; &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; localhost:9092
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, messages typed in the producer will appear in the consumer instantly, demonstrating Kafka’s real-time streaming in action.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Mastering SQL for Data Engineering: Advanced Queries, Optimization, and Data Modeling Best Practices</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Mon, 10 Feb 2025 11:23:53 +0000</pubDate>
      <link>https://dev.to/bobokech/mastering-sql-for-data-engineering-advanced-queries-optimization-and-data-modeling-best-practices-35hl</link>
      <guid>https://dev.to/bobokech/mastering-sql-for-data-engineering-advanced-queries-optimization-and-data-modeling-best-practices-35hl</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;SQL (Structured Query Language) is fundamental to data engineering, serving as the backbone for managing, transforming, and analyzing data efficiently. Data engineers rely on SQL to build robust data pipelines, extract and process large datasets, and optimize query performance for analytical workloads. Mastering SQL is essential for handling real-world ETL (Extract, Transform, Load) processes, ensuring data integrity, and enabling efficient reporting and analytics.  &lt;/p&gt;

&lt;p&gt;This guide covers core SQL concepts, advanced techniques, performance optimization strategies, and data modeling best practices, equipping data engineers with the knowledge to handle complex data challenges.  &lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Core SQL Concepts for Data Engineering&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Essential SQL Commands&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data engineers frequently use the following SQL operations in data pipelines and ETL processes:  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;a. SELECT&lt;/strong&gt; – Retrieving data from tables
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;b. WHERE&lt;/strong&gt; – Filtering records based on conditions
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Different Types of JOINs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;SQL provides several types of joins, each serving a different purpose in combining data from multiple tables.  &lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;a. INNER JOIN (Default JOIN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Retrieves only matching rows from both tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;b. LEFT JOIN (LEFT OUTER JOIN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Retrieves all rows from the left table and only matching rows from the right table. If no match is found, NULL values are returned for columns from the right table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;c. RIGHT JOIN (RIGHT OUTER JOIN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Retrieves all rows from the right table and only matching rows from the left table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;RIGHT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;departments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;d. FULL JOIN (FULL OUTER JOIN)&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Retrieves all rows from both tables, filling NULLs where there is no match.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;FULL&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;e. CROSS JOIN&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Returns the Cartesian product of both tables, meaning every row from the first table is combined with every row from the second table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Advanced SQL Techniques&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Recursive Queries and Common Table Expressions (CTEs)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;CTEs simplify complex queries and improve readability. Recursive CTEs help navigate hierarchical data such as organizational structures.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;EmployeeHierarchy&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;level&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;full_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;EmployeeHierarchy&lt;/span&gt; &lt;span class="n"&gt;eh&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;eh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;EmployeeHierarchy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Window Functions for Advanced Analytics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Window functions enable running totals, rankings, and moving averages without affecting row-level granularity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;3. Complex JOINs and Subqueries for Efficient Data Retrieval&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;Query Optimization and Performance Tuning&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Understanding Execution Plans and Query Profiling&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;EXPLAIN&lt;/code&gt; or &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; to examine query execution plans.
&lt;/li&gt;
&lt;li&gt;Identify bottlenecks such as full table scans and inefficient joins.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt; 
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;2. Indexing Strategies for Speed Optimization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Indexes significantly improve query performance by reducing the number of scanned rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_orders_total&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;B-tree indexes&lt;/strong&gt; for range queries.
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;Hash indexes&lt;/strong&gt; for exact lookups.
&lt;/li&gt;
&lt;li&gt;Avoid over-indexing, which can slow down &lt;code&gt;INSERT&lt;/code&gt; and &lt;code&gt;UPDATE&lt;/code&gt; operations.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Techniques for Reducing Query Complexity&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;*&lt;em&gt;Avoid SELECT *&lt;/em&gt;* to minimize unnecessary data retrieval.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimize JOINs&lt;/strong&gt; by ensuring indexed columns are used.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Denormalize data&lt;/strong&gt; selectively for read-heavy workloads.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Data Modeling Best Practices&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Normalization vs. Denormalization&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Normalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When ensuring data consistency and reducing redundancy is a priority.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Denormalization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;When optimizing for read-heavy queries in analytics.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Designing Efficient Relational Schemas&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ensure primary and foreign keys are properly defined.
&lt;/li&gt;
&lt;li&gt;Use appropriate data types to optimize storage.
&lt;/li&gt;
&lt;li&gt;Partition large tables for better query performance.
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Star Schema vs. Snowflake Schema for Analytical Queries&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Star Schema&lt;/strong&gt;: Fewer joins, better performance for OLAP queries.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snowflake Schema&lt;/strong&gt;: Reduces data redundancy but increases query complexity.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Real-World Application &amp;amp; Case Study&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Optimizing a Slow SQL Query&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario&lt;/strong&gt;: A report query on a large &lt;code&gt;sales&lt;/code&gt; table takes 30 seconds to execute.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimization Steps:&lt;/strong&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Use &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; to inspect the execution plan.
&lt;/li&gt;
&lt;li&gt;Add indexes to filter columns (&lt;code&gt;date&lt;/code&gt;, &lt;code&gt;customer_id&lt;/code&gt;).
&lt;/li&gt;
&lt;li&gt;Replace a subquery with a JOIN to reduce repeated calculations.
&lt;/li&gt;
&lt;li&gt;Use materialized views for pre-aggregated data.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Optimized Query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_sales_date&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sale_date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2024-12-31'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Mastering SQL is crucial for data engineers to build efficient data pipelines, optimize query performance, and design scalable data models. By leveraging advanced SQL techniques, indexing strategies, and best practices in data modeling, engineers can significantly improve data processing efficiency and analytics. Applying these concepts in real-world scenarios ensures data is handled optimally for business intelligence and decision-making.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next Steps:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Practice SQL queries on large datasets.
&lt;/li&gt;
&lt;li&gt;Experiment with indexing and query profiling.
&lt;/li&gt;
&lt;li&gt;Implement data modeling techniques in real-world projects.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By continuously refining SQL skills, data engineers can optimize performance and make data-driven processes more efficient. 🚀&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>bigdata</category>
      <category>aws</category>
      <category>azure</category>
    </item>
    <item>
      <title>A Comprehensive Guide to Setting Up a Data Engineering Project Environment.</title>
      <dc:creator>Bob Otieno Okech</dc:creator>
      <pubDate>Mon, 27 Jan 2025 13:52:27 +0000</pubDate>
      <link>https://dev.to/bobokech/a-comprehensive-guide-to-setting-up-a-data-engineering-project-environment-4e1d</link>
      <guid>https://dev.to/bobokech/a-comprehensive-guide-to-setting-up-a-data-engineering-project-environment-4e1d</guid>
      <description>&lt;p&gt;In today’s fast-paced business world, data is generated every second. This data holds the potential to provide valuable insights and drive decisions, yet much of it goes unused. Managing data from multiple sources manually can be time-consuming and error-prone. That’s why organizations need an efficient system to gather data from various sources, transform it to align with business needs, and store it in a central location for analysis. This is the role of a data platform.&lt;/p&gt;

&lt;p&gt;A data platform serves as a unified hub for collecting, processing, and storing an organization’s data. It ensures that business users can access accurate, consistent, and actionable data to inform their strategies and decisions.&lt;/p&gt;

&lt;p&gt;In this guide, we’ll walk you through building your first data engineering environment using a selection of powerful tools and services.&lt;/p&gt;

&lt;p&gt;In this section, we will go step by step in creating a data engineering environment using various tools and services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools and services needed
&lt;/h2&gt;

&lt;p&gt;To get started, you’ll need the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;AWS S3 Bucket – For storing raw and processed data.&lt;/li&gt;
&lt;li&gt;PostgreSQL Database – For structured data storage.&lt;/li&gt;
&lt;li&gt;DBeaver – A database management tool for querying and managing databases.&lt;/li&gt;
&lt;li&gt;Python 3 – For automation, data transformation, and integration tasks.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Setting Up Your First S3 Bucket on AWS
&lt;/h2&gt;

&lt;p&gt;To create an S3 bucket, you’ll need an authenticated AWS account. If you don’t have one, start by signing up &lt;a href="https://www.googleadservices.com/pagead/aclk?sa=L&amp;amp;ai=DChcSEwjDlLz11pOLAxXIlFAGHVUaBqAYABAAGgJkZw&amp;amp;ae=2&amp;amp;co=1&amp;amp;gclid=Cj0KCQiA19e8BhCVARIsALpFMgE88T-CbsQhIPMFTuGXpZwzOOEgQ6Bi-XfAbqhIz8lPcKH2aUtiivkaAnPfEALw_wcB&amp;amp;ohost=www.google.com&amp;amp;cid=CAESVuD2DUVKyoVwm3gjSC8ZWn0YGO_3O3Fe_VVy22K5hnaqfytWBssS9AYju5MCsW5X7qpq2dmzCCHyhjiGllKGfZWpTapkpgOXfxsZ01YIjPKL9EZTIvPA&amp;amp;sig=AOD64_1JMQPVbJTKY9Mfbp3sMcemHUpYaA&amp;amp;q&amp;amp;adurl&amp;amp;ved=2ahUKEwirorb11pOLAxWHQ0EAHYDkCFkQ0Qx6BAgKEAE" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Access the S3 Service
&lt;/h3&gt;

&lt;p&gt;Once logged in, use the AWS search bar to search for S3. Select the service from the search results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdryfsnofbcxp1pm1vx7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdryfsnofbcxp1pm1vx7.png" alt="Search for s3" width="800" height="213"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Create a Bucket
&lt;/h3&gt;

&lt;p&gt;You’ll be taken to the S3 dashboard. Click the Create Bucket button.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi0164bq110dfzll15tlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi0164bq110dfzll15tlt.png" alt="Create a bucket" width="550" height="180"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Step 3: Name Your Bucket
&lt;/h4&gt;

&lt;p&gt;Provide a unique name for your bucket, keeping AWS naming conventions in mind. Once you've entered the required details, click Create Bucket.&lt;/p&gt;

&lt;p&gt;That’s it! Your S3 bucket is now ready to store files. &lt;/p&gt;

&lt;h4&gt;
  
  
  Step 4: Upload any file
&lt;/h4&gt;

&lt;p&gt;Navigate to the upload section and drag and drop a CSV file. Once the data has been uploaded the upload status will read succeeded as below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewy21gf0u2p2f4ergsv2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewy21gf0u2p2f4ergsv2.png" alt="s3 bucket uploads" width="800" height="92"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Connect Python with S3
&lt;/h2&gt;

&lt;p&gt;To connect Python to your S3 bucket, you'll need the Boto3 library.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install boto3 in your pc
&lt;/h3&gt;

&lt;p&gt;You can run this code in your terminal to install boto3 if not installed&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install boto3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll then go to IAM in your AWS console and create a new user to access the bucket. Search for IAM in the search bar navigate to the IAM management console and click users in the navigation bar.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Create a IAM to access the s3 bucket
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd52jy5rodr7xgdfjz7v6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd52jy5rodr7xgdfjz7v6.png" alt="IAM mgt console" width="800" height="133"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ensure when setting up the user permission, assign the amazons3full access policy as shown below  &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foa8spp3scsh3l5z9d09i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foa8spp3scsh3l5z9d09i.png" alt="Assign permission to the console" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click Create New User, fill in the required information, and generate access keys and secret keys.&lt;/p&gt;

&lt;p&gt;Within your Python environment run the following code&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Connect to s3 using python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# import boto3 library
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="c1"&gt;# Create an S3 resource using the connection parameters
&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resource&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;region_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;connection_params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;aws_access_key_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;connection_params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_access_key_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;aws_secret_access_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;connection_params&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_secret_access_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate a list of all buckets available in our s3_resources bucket
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s3_resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since the service we are accessing is s3, the service_name parameter should be s3, fill in the console region_name, and use the access keys and secret keys to access the storage.&lt;/p&gt;

&lt;p&gt;After executing the code above the list of all buckets we have in s3 as shown below&lt;br&gt;
&lt;strong&gt;List of buckets using AWS UI&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkczyg9om8fe2hgsnge4r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkczyg9om8fe2hgsnge4r.png" alt="List of buckets using AWS UI" width="712" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95ci497hvxbkvc9ypc6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95ci497hvxbkvc9ypc6e.png" alt="Python" width="703" height="269"&gt;&lt;/a&gt;&lt;br&gt;
    &lt;em&gt;List of buckets using python3&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting Postgres database engine
&lt;/h3&gt;

&lt;p&gt;You can set up your Postgres instance locally but we'll choose to use a cloud provider like Aiven.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Create an account on Aiven and the database
&lt;/h3&gt;

&lt;p&gt;Log in to aiven.com and create an account.&lt;br&gt;
Navigate to create a new project and enter the details.&lt;br&gt;
Click on the new project and create a service. A list of services will be shown as below. Select Postgres and ensure you have selected the free plan&lt;br&gt;
then create a service&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmo2mokx8mo9g3v5aasgy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmo2mokx8mo9g3v5aasgy.png" alt="List of services available in Aiven" width="800" height="454"&gt;&lt;/a&gt;&lt;br&gt;
Once the service is built, you'll be able to view it in the project file as shown below mine is mypostgres-001 and the status is running&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffs8wnmrk22q85u8ioby0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffs8wnmrk22q85u8ioby0.png" alt="DB status in aiven" width="650" height="105"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  Setting up Dbeaver in your PC
&lt;/h2&gt;

&lt;p&gt;Dbeaver is a database management tool for querying and managing databases.&lt;br&gt;
We'll Dbeaver to manage our newly created Postgres database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Download and Install Dbeaver
&lt;/h3&gt;

&lt;p&gt;Navigate to &lt;a href="https://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=&amp;amp;cad=rja&amp;amp;uact=8&amp;amp;ved=2ahUKEwilx76X0pWLAxWoW0EAHRGvOMsQFnoECAkQAQ&amp;amp;url=https%3A%2F%2Fdbeaver.io%2Fdownload%2F&amp;amp;usg=AOvVaw0sviKTtVTU6CUV6x0KSHLo&amp;amp;opi=89978449" rel="noopener noreferrer"&gt;Dbeaver&lt;/a&gt; to download and install Dbeaver.&lt;/p&gt;

&lt;p&gt;Once you have Dbeaver installed,&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Connect Dbeaver with your Postgres database
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F497nivgxsq9xuivuu573.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F497nivgxsq9xuivuu573.png" alt="Connection Icon in DBeaver" width="786" height="257"&gt;&lt;/a&gt;&lt;br&gt;
Navigate to the connect icon and add a connection the our Postgres database.&lt;/p&gt;

&lt;p&gt;Under the Connection setting, ensure the details match what Aiven has provided as database credentials&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm52e6tw4vr3j0lknmgfu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm52e6tw4vr3j0lknmgfu.png" alt="Dbeaver connection parameters" width="800" height="235"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk04jy40cdug16we9efxt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk04jy40cdug16we9efxt.png" alt="Aiven connection parameters" width="800" height="313"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Test your connection and ensure it's connected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2am9lus5cb85ki5xjrs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2am9lus5cb85ki5xjrs.png" alt="Connection confirmation" width="432" height="245"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this stage, you can fully create databases and tables to store your data in the new database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Run the code to ensure you can manipulate objects in the database.
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a database called luxdev_test&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;luxdev_test&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Connect to the newly created database&lt;/span&gt;
&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="n"&gt;luxdev_test&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create a schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create a table to store the data&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;students&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="k"&gt;position&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Insert a few records into the table&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;students&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;position&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;VALUES&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Peter'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Mercy'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See the snapshot of output to get when executing the code in the snapshot, showing the data has successfully been added to the student's table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa88thfpun95m777ga14u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa88thfpun95m777ga14u.png" alt="Output generated by Dbeaver" width="556" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping it Up
&lt;/h2&gt;

&lt;p&gt;Congratulations! You’ve successfully walked through the foundational steps of setting up a data engineering project environment. By combining the power of AWS S3 for data storage, Python for automation, PostgreSQL for structured data management, and DBeaver for database management, you’ve built a scalable foundation for handling and analyzing data efficiently.&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
