<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: John Wakaba</title>
    <description>The latest articles on DEV Community by John Wakaba (@john_analytics).</description>
    <link>https://dev.to/john_analytics</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1752671%2F8a1f42a0-3708-4440-a5e4-40fa766ff763.png</url>
      <title>DEV Community: John Wakaba</title>
      <link>https://dev.to/john_analytics</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/john_analytics"/>
    <language>en</language>
    <item>
      <title>ETL vs ELT: Which One Should You Use and Why?</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Fri, 10 Apr 2026 06:17:56 +0000</pubDate>
      <link>https://dev.to/john_analytics/etl-vs-elt-which-one-should-you-use-and-why-2247</link>
      <guid>https://dev.to/john_analytics/etl-vs-elt-which-one-should-you-use-and-why-2247</guid>
      <description>&lt;p&gt;&lt;em&gt;A Beginner's Guide to Data Pipeline Architecture&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;If you have ever worked with data or heard engineers talk about data pipelines — you have probably come across the terms ETL and ELT. They sound almost identical, but they represent two different philosophies for moving and processing data. Understanding the difference between them can help you make better architectural decisions for your projects or simply help you follow technical conversations with more confidence.&lt;/p&gt;

&lt;p&gt;This article breaks down both approaches, explains where each one shines, and helps you figure out which one might be the right choice for your situation.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What is ETL?
&lt;/h2&gt;

&lt;p&gt;ETL stands for &lt;strong&gt;Extract, Transform, Load&lt;/strong&gt;. It is a three-step process used to move data from one place to another  usually from various source systems into a central data warehouse.&lt;/p&gt;

&lt;p&gt;Think of it like a water treatment plant. Water (data) is collected from rivers (source systems), cleaned and purified (transformed), and then distributed to homes (loaded into a warehouse). The treatment happens before the water reaches your tap.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Steps of ETL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1 — Extract
&lt;/h3&gt;

&lt;p&gt;Data is pulled from one or more source systems. These sources could be relational databases (like MySQL or PostgreSQL), spreadsheets, APIs, log files, CRM systems like Salesforce, or even flat files on a server.&lt;/p&gt;

&lt;p&gt;Example: A retail company extracts daily sales records from its point-of-sale (POS) system, customer data from its CRM, and inventory data from its warehouse management system.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 2 — Transform
&lt;/h3&gt;

&lt;p&gt;This is the most complex step. The extracted raw data is processed and reshaped in a separate staging environment (called the ETL engine or transformation layer) before it ever enters the destination.&lt;/p&gt;

&lt;p&gt;Transformations can include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleaning data (removing duplicates, fixing null values)&lt;/li&gt;
&lt;li&gt;Standardising formats (converting dates from DD/MM/YYYY to YYYY-MM-DD)&lt;/li&gt;
&lt;li&gt;Enriching data (adding new computed columns, e.g. calculating customer age from a birth date)&lt;/li&gt;
&lt;li&gt;Joining data from multiple sources into a single, consistent structure&lt;/li&gt;
&lt;li&gt;Applying business rules (e.g. marking orders over $10,000 as high-value)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: The sales data is cleaned to remove duplicate transaction IDs, dates are normalised to UTC, and customer names are standardised to title case.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 3 — Load
&lt;/h3&gt;

&lt;p&gt;The now clean, structured data is loaded into the destination typically a data warehouse like Microsoft SQL Server, Oracle. Because the data was already transformed, it arrives ready to query.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ETL in One Sentence:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
"Extract the data, clean and reshape it on a separate server, then load only the polished result into your warehouse."&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Use Cases and Strengths of ETL
&lt;/h2&gt;

&lt;p&gt;ETL is well suited for scenarios where data sources are smaller in scale but transformations are complex, where there is a need to offload transformation processing away from the target system, and where data security is a priority requiring sensitive data to be masked or encrypted before it ever reaches a warehouse. ETL is an excellent choice when data consistency, quality, and compliance are non-negotiable.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Core Strength of ETL:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
ETL processes data before it reaches the warehouse, reducing the risk of sensitive data exposure and ensuring that all data conforms to business rules and standards from the moment it lands.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Python as an ETL Tool
&lt;/h2&gt;

&lt;p&gt;Python has become a go-to language for building ETL pipelines. Its rich ecosystem of libraries and frameworks makes every step of the ETL process extract, transform, and load more accessible and flexible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Python Libraries for ETL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pandas
&lt;/h3&gt;

&lt;p&gt;Pandas is the workhorse of data manipulation in Python. Its DataFrame structure makes it easy to load raw data, clean it, filter rows, rename columns, and reshape datasets. For small to medium sized ETL jobs, Pandas alone can handle the entire transformation step.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQLAlchemy
&lt;/h3&gt;

&lt;p&gt;SQLAlchemy provides a consistent and database agnostic way to interact with relational databases. It is especially useful in the Extract phase (reading from MySQL, PostgreSQL, SQL Server) and the Load phase (writing results back into a target database).&lt;/p&gt;

&lt;h3&gt;
  
  
  PySpark
&lt;/h3&gt;

&lt;p&gt;When your data volumes outgrow what a single machine can handle, PySpark steps in. It offers distributed data processing across a cluster of machines, making it suitable for large scale ETL tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Luigi and Apache Airflow
&lt;/h3&gt;

&lt;p&gt;ETL pipelines are rarely one off scripts. Luigi and Apache Airflow help orchestrate and schedule ETL pipelines. Airflow has become the industry standard for managing complex multi-step workflows.&lt;/p&gt;




&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;th&gt;What It Means in Practice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Python libraries allow fully custom ETL processes tailored to business needs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;PySpark enables processing of massive datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Community Support&lt;/td&gt;
&lt;td&gt;Large ecosystem of tutorials and libraries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ecosystem Integration&lt;/td&gt;
&lt;td&gt;Works well with cloud, APIs, and databases&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  2. What is ELT?
&lt;/h2&gt;

&lt;p&gt;ELT stands for &lt;strong&gt;Extract, Load, Transform&lt;/strong&gt;. Notice the difference: the &lt;strong&gt;T (Transform)&lt;/strong&gt; and &lt;strong&gt;L (Load)&lt;/strong&gt; have swapped positions.&lt;/p&gt;

&lt;p&gt;Instead of transforming data before loading it, ELT loads the raw data first and then transforms it inside the target system usually a modern cloud data warehouse.&lt;/p&gt;

&lt;p&gt;Using the water analogy again: instead of treating water before distribution, you pipe all the raw water directly into a large, powerful filtration tank at the destination.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Steps of ELT
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1 — Extract
&lt;/h3&gt;

&lt;p&gt;Same as ETL — data is pulled from various source systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Load
&lt;/h3&gt;

&lt;p&gt;Raw data is loaded directly into the target system without transformation.&lt;/p&gt;

&lt;p&gt;Example: Raw transaction records are loaded into a Snowflake table called &lt;code&gt;raw_transactions&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Transform
&lt;/h3&gt;

&lt;p&gt;Transformations are applied inside the warehouse using SQL or tools like dbt.&lt;/p&gt;

&lt;p&gt;Example: A dbt model queries raw_transactions and creates a clean table called &lt;code&gt;fact_sales&lt;/code&gt;.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;ELT in One Sentence:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
"Extract the data, load all of it into your powerful cloud warehouse first, then transform it there."&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why ELT Has Become So Popular
&lt;/h2&gt;

&lt;p&gt;ELT’s rise is tied to cloud warehouses like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These systems provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;elastic compute power&lt;/li&gt;
&lt;li&gt;columnar storage&lt;/li&gt;
&lt;li&gt;massively parallel processing (MPP)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Advantages of ELT
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Flexibility
&lt;/h3&gt;

&lt;p&gt;Raw data is stored first, allowing transformation logic to change later.&lt;/p&gt;

&lt;h3&gt;
  
  
  Efficiency at Scale
&lt;/h3&gt;

&lt;p&gt;Parallel processing makes ELT faster for large datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Suitability for Large Datasets
&lt;/h3&gt;

&lt;p&gt;ELT scales horizontally as data volumes grow.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Key Differences Between ETL and ELT
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;ETL&lt;/th&gt;
&lt;th&gt;ELT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Transform Location&lt;/td&gt;
&lt;td&gt;Outside the warehouse&lt;/td&gt;
&lt;td&gt;Inside the warehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best For&lt;/td&gt;
&lt;td&gt;Structured data&lt;/td&gt;
&lt;td&gt;Big data analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scalability&lt;/td&gt;
&lt;td&gt;Limited by server&lt;/td&gt;
&lt;td&gt;Cloud scalable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Schema defined early&lt;/td&gt;
&lt;td&gt;Schema flexible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Slower load&lt;/td&gt;
&lt;td&gt;Faster load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Data filtered before load&lt;/td&gt;
&lt;td&gt;Raw data stored first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Popular Tools&lt;/td&gt;
&lt;td&gt;Talend, Informatica&lt;/td&gt;
&lt;td&gt;dbt, Snowflake&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Understanding the Most Important Differences
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Where Does Transformation Happen?
&lt;/h2&gt;

&lt;p&gt;ETL transforms data before loading.&lt;br&gt;&lt;br&gt;
ELT transforms data after loading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Raw Data Preservation
&lt;/h2&gt;

&lt;p&gt;ELT keeps original raw data available for reprocessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scalability
&lt;/h2&gt;

&lt;p&gt;ELT scales automatically with cloud warehouses.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed and Data Ingestion
&lt;/h2&gt;

&lt;p&gt;ELT often loads data faster because transformation happens later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Control and Data Exposure
&lt;/h2&gt;

&lt;p&gt;ETL offers more control over what enters the warehouse.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Real-World Use Cases
&lt;/h2&gt;

&lt;h2&gt;
  
  
  When ETL Makes Sense
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Banking and Financial Reporting
&lt;/h3&gt;

&lt;p&gt;Strict validation rules required.&lt;/p&gt;

&lt;p&gt;Tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Informatica PowerCenter&lt;/li&gt;
&lt;li&gt;IBM DataStage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Healthcare Data Integration
&lt;/h3&gt;

&lt;p&gt;Standardised clinical data formats required.&lt;/p&gt;

&lt;p&gt;Tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Talend&lt;/li&gt;
&lt;li&gt;Microsoft SSIS&lt;/li&gt;
&lt;li&gt;Apache NiFi&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Legacy System Migration
&lt;/h3&gt;

&lt;p&gt;Cleaning historical data before migration.&lt;/p&gt;




&lt;h2&gt;
  
  
  When ELT Makes Sense
&lt;/h2&gt;

&lt;h3&gt;
  
  
  E-commerce Analytics Platform
&lt;/h3&gt;

&lt;p&gt;Tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fivetran&lt;/li&gt;
&lt;li&gt;Snowflake&lt;/li&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SaaS Product Analytics
&lt;/h3&gt;

&lt;p&gt;Tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Segment&lt;/li&gt;
&lt;li&gt;Google BigQuery&lt;/li&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Marketing Attribution Analysis
&lt;/h3&gt;

&lt;p&gt;Tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airbyte&lt;/li&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt;dbt&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Popular Tools for ETL and ELT
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Best Known For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Informatica PowerCenter&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;td&gt;Enterprise pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft SSIS&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;td&gt;SQL Server integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Talend Open Studio&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;td&gt;Open-source pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apache NiFi&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;td&gt;Real-time flows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Glue&lt;/td&gt;
&lt;td&gt;ETL/ELT&lt;/td&gt;
&lt;td&gt;AWS integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fivetran&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;td&gt;automated connectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Airbyte&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;td&gt;open-source connectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dbt&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;td&gt;SQL transformations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Snowflake + dbt&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;td&gt;modern stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google BigQuery&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;td&gt;serverless analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  A Closer Look at dbt
&lt;/h2&gt;

&lt;p&gt;dbt enables analysts to write SQL SELECT statements that transform raw data directly inside the warehouse.&lt;/p&gt;

&lt;p&gt;Features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;version control&lt;/li&gt;
&lt;li&gt;testing&lt;/li&gt;
&lt;li&gt;documentation&lt;/li&gt;
&lt;li&gt;modular SQL models&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Which One Should You Choose?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Using cloud warehouse&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sensitive data&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frequent transformation changes&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legacy infrastructure&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SQL-based teams&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need raw data history&lt;/td&gt;
&lt;td&gt;ELT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;regulated industries&lt;/td&gt;
&lt;td&gt;ETL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;General Rule of Thumb:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If you are building a new pipeline using a cloud warehouse, ELT is often the better starting point.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  7. Putting It All Together: A Practical Example
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Scenario: Online Bookstore
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data Sources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Orders database (PostgreSQL)&lt;/li&gt;
&lt;li&gt;Customer reviews (MongoDB)&lt;/li&gt;
&lt;li&gt;Marketing emails (Mailchimp API)&lt;/li&gt;
&lt;li&gt;Website behaviour (Google Analytics)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Goal
&lt;/h3&gt;

&lt;p&gt;Build a dashboard showing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;daily revenue&lt;/li&gt;
&lt;li&gt;top-selling books&lt;/li&gt;
&lt;li&gt;customer acquisition cost&lt;/li&gt;
&lt;li&gt;review sentiment trends&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ETL Approach
&lt;/h2&gt;

&lt;p&gt;Talend extracts from multiple sources, transforms on ETL server, loads into SQL Server warehouse.&lt;/p&gt;




&lt;h2&gt;
  
  
  ELT Approach
&lt;/h2&gt;

&lt;p&gt;Fivetran loads raw data into Snowflake.&lt;br&gt;&lt;br&gt;
dbt transforms raw tables into analytics models.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Which approach wins?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
ELT provides more flexibility for analytics teams.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;ETL and ELT are architectural patterns with different strengths.&lt;/p&gt;

&lt;p&gt;ETL excels in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;regulated environments&lt;/li&gt;
&lt;li&gt;structured pipelines&lt;/li&gt;
&lt;li&gt;legacy systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ELT excels in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cloud analytics&lt;/li&gt;
&lt;li&gt;scalability&lt;/li&gt;
&lt;li&gt;flexibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key difference:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;ETL cleans before storing.&lt;br&gt;&lt;br&gt;
ELT stores before cleaning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As modern data tooling evolves, ELT is becoming the default approach for analytics engineering workflows.&lt;/p&gt;

&lt;p&gt;Understanding both approaches allows you to design better pipelines and make smarter technical decisions.&lt;/p&gt;




</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Advanced SQL Techniques Every Data Analyst Should Know</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Thu, 09 Apr 2026 10:54:27 +0000</pubDate>
      <link>https://dev.to/john_analytics/advanced-sql-techniques-every-data-analyst-should-know-1b86</link>
      <guid>https://dev.to/john_analytics/advanced-sql-techniques-every-data-analyst-should-know-1b86</guid>
      <description>&lt;p&gt;&lt;em&gt;You can write a SELECT statement. You can JOIN tables and slap on a WHERE clause. But somewhere between "I know SQL" and "I really know SQL" lies a gap that separates analysts who get things done from analysts who get things done fast, elegantly, and correctly.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This article covers the techniques that live in that gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Window Functions
&lt;/h2&gt;

&lt;p&gt;Most analysts discover &lt;code&gt;GROUP BY&lt;/code&gt; early and lean on it forever. Window functions do something fundamentally different — they let you compute aggregates &lt;em&gt;without collapsing rows&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dept_avg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;diff_from_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You get one row per employee, but each row carries its department's average alongside it. No subquery. No self-join. No mess.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running totals and moving averages
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;running_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;
    &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rolling_7day_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;daily_sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ROWS BETWEEN&lt;/code&gt; clause is where window functions get powerful. You can define exactly which rows belong to each window frame — preceding rows, following rows, or any combination.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ranking without ties headaches
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;product_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;       &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank_with_gaps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank_no_gaps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;RANK()&lt;/code&gt; skips numbers after ties. &lt;code&gt;DENSE_RANK()&lt;/code&gt; doesn't. &lt;code&gt;ROW_NUMBER()&lt;/code&gt; ignores ties entirely and just counts. Know which one you actually need before reaching for the first one.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. CTEs
&lt;/h2&gt;

&lt;p&gt;Common Table Expressions (CTEs) don't make your query faster (usually), but they make it more readable — and readable queries are maintainable queries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
  &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;revenue_growth&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_month_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
           &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="mi"&gt;2&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mom_growth_pct&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;monthly_revenue&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;revenue_growth&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;mom_growth_pct&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each CTE is a named, composable step. You can read this top-to-bottom and understand exactly what's happening. Compare that to a nested subquery version and you'll never go back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recursive CTEs for hierarchical data
&lt;/h3&gt;

&lt;p&gt;When you have org charts, category trees, or any parent-child relationship, recursive CTEs are the right tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;org_chart&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="c1"&gt;-- Base case: top-level managers&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

  &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

  &lt;span class="c1"&gt;-- Recursive case: direct reports&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;oc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;oc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;' &amp;gt; '&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
  &lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;org_chart&lt;/span&gt; &lt;span class="n"&gt;oc&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;org_chart&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This walks the entire hierarchy in a single query, no matter how deep it goes.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Advanced Aggregations: GROUPING SETS, ROLLUP, and CUBE
&lt;/h2&gt;

&lt;p&gt;Say goodbye to UNION ALL chains for multi-level summaries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;GROUPING&lt;/span&gt; &lt;span class="k"&gt;SETS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;-- subtotals by region + category&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;                    &lt;span class="c1"&gt;-- subtotals by region only&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="c1"&gt;-- subtotals by category only&lt;/span&gt;
  &lt;span class="p"&gt;()&lt;/span&gt;                           &lt;span class="c1"&gt;-- grand total&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ROLLUP&lt;/code&gt; is a shorthand when your groupings have a natural hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;ROLLUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;-- Produces: (year, quarter, month), (year, quarter), (year), ()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;CUBE&lt;/code&gt; generates all possible combinations. Useful for cross-dimensional analysis, but be careful — it grows exponentially with the number of dimensions.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The FILTER Clause: Conditional Aggregation Without CASE
&lt;/h2&gt;

&lt;p&gt;Most people do conditional aggregation like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'completed'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;completed_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'refunded'&lt;/span&gt;  &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;refunded_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's a cleaner way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'completed'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;completed_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'refunded'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;refunded_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;FILTER&lt;/code&gt; clause attaches directly to the aggregate function. It's not just aesthetically cleaner — it makes the intent unmistakably clear, and it works with any aggregate function including &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt;, &lt;code&gt;STRING_AGG&lt;/code&gt;, and window functions.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. LATERAL Joins: Correlated Subqueries That Scale
&lt;/h2&gt;

&lt;p&gt;A &lt;code&gt;LATERAL&lt;/code&gt; join lets a subquery in the &lt;code&gt;FROM&lt;/code&gt; clause reference columns from tables that appear earlier in the same &lt;code&gt;FROM&lt;/code&gt; clause. Think of it as a &lt;code&gt;for each row, compute this&lt;/code&gt; operation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="k"&gt;LATERAL&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;
  &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
  &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
  &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
  &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This fetches the 3 most recent orders per customer — something that's awkward with a window function and impossible with a regular join. Lateral joins shine for top-N-per-group patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. String Aggregation and Array Operations
&lt;/h2&gt;

&lt;p&gt;Real-world data is messy. Sometimes you need to collapse multiple rows into a single delimited string, or work with arrays directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Collapse tags into a comma-separated list per article&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;STRING_AGG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;', '&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;article_tags&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;article_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- PostgreSQL: aggregate into an actual array&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ARRAY_AGG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;purchased_products&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;purchases&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And once you have arrays, you can query into them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_preferences&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="s1"&gt;'dark_mode'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ANY&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;feature_flags&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  7. Query Optimization: Reading EXPLAIN Output
&lt;/h2&gt;

&lt;p&gt;Fast queries aren't magic — they're the result of understanding what the database is actually doing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output tells you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Seq Scan vs Index Scan&lt;/strong&gt;: sequential scans on large tables are a red flag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hash Join vs Nested Loop&lt;/strong&gt;: hash joins are usually better for large datasets; nested loops for small ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actual rows vs estimated rows&lt;/strong&gt;: large discrepancies mean stale statistics — run &lt;code&gt;ANALYZE&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rows removed by filter&lt;/strong&gt;: indexes on the right columns eliminate these entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few high-impact habits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Index foreign keys (the database often won't do this automatically)&lt;/li&gt;
&lt;li&gt;Avoid functions on indexed columns in WHERE clauses — &lt;code&gt;WHERE YEAR(created_at) = 2024&lt;/code&gt; can't use an index on &lt;code&gt;created_at&lt;/code&gt;, but &lt;code&gt;WHERE created_at &amp;gt;= '2024-01-01'&lt;/code&gt; can&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;LIMIT&lt;/code&gt; with &lt;code&gt;OFFSET&lt;/code&gt; carefully — large offsets scan and discard rows; keyset pagination is faster for deep pages&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Date and Time Manipulation
&lt;/h2&gt;

&lt;p&gt;Time-series analysis is central to most analyst work, and doing it well requires a solid command of date functions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Cohort analysis: group users by signup month&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;signup_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;cohort_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activity_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;activity_month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;DATE_PART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AGE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;activity_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;signup_date&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;months_since_signup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_users&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_activity&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Generate a complete date spine (no gaps even if data is missing)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'2024-12-31'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s1"&gt;'1 day'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;interval&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;gs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The date spine pattern is essential for time-series work — join it against your data and missing dates appear as NULLs rather than disappearing from your results entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. NULL Handling
&lt;/h2&gt;

&lt;p&gt;NULL is not zero. NULL is not an empty string. NULL is the absence of a value, and it propagates in ways that catch everyone out at some point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- This looks right but silently excludes NULLs from the average&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- NULL values are ignored by AVG — this may be what you want, or may not be&lt;/span&gt;

&lt;span class="c1"&gt;-- Be explicit:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_excluding_nulls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_treating_null_as_zero&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;non_null_rows&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the classic NULL comparison mistake:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- This never returns rows where manager_id is NULL&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;

&lt;span class="c1"&gt;-- You need:&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NULLIF&lt;/code&gt; is your friend for division-by-zero protection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue_per_session&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;traffic_data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;sessions&lt;/code&gt; is 0, &lt;code&gt;NULLIF&lt;/code&gt; returns NULL, and dividing by NULL yields NULL — no error, no corrupt data.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Pivot Tables with CASE and FILTER
&lt;/h2&gt;

&lt;p&gt;SQL doesn't have a native PIVOT keyword in most databases, but you can build pivot tables with conditional aggregation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue_2022&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue_2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;revenue_2024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;
      &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2022&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pct_change_2022_to_2024&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;product_category&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue_2024&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="n"&gt;NULLS&lt;/span&gt; &lt;span class="k"&gt;LAST&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;These techniques aren't separate tricks to memorize — they combine. A typical advanced query might use a CTE to build a clean base dataset, window functions to compute ranks and running totals, &lt;code&gt;FILTER&lt;/code&gt; for conditional aggregation, and &lt;code&gt;LATERAL&lt;/code&gt; to pull related records. The result is a single, readable, performant query that would have taken three separate queries and some Python glue code to produce otherwise.&lt;/p&gt;

</description>
      <category>data</category>
      <category>sql</category>
      <category>postgressql</category>
      <category>analytics</category>
    </item>
    <item>
      <title>CONNECTING POSTGRESQL WITH POWERBI (FOR A LOAN PERFORMANCE DASHBOARD)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Thu, 09 Apr 2026 10:23:23 +0000</pubDate>
      <link>https://dev.to/john_analytics/-connecting-pstgresql-with-powerbi-for-a-loan-performance-dashboard-5cjg</link>
      <guid>https://dev.to/john_analytics/-connecting-pstgresql-with-powerbi-for-a-loan-performance-dashboard-5cjg</guid>
      <description>&lt;h2&gt;
  
  
  INTRODUCTION
&lt;/h2&gt;

&lt;p&gt;Power BI is one of the most widely used business intelligence tools out there. One of the most valuable skills for a data analyst is the ability to transform raw data into actionable insights. &lt;/p&gt;

&lt;p&gt;In practical financial institutions like banks, SACCOs, and fintech firms, loan-related data is produced on a daily basis. This encompasses details about borrowers, disbursed amounts, repayment histories, and indicators of loan performance. Left unanalyzed, all of this information sits dormant and serves no purpose.&lt;/p&gt;

&lt;p&gt;In this tutorial, we will develop a Loan Performance Dashboard by integrating PostgreSQL with Power BI. This project illustrates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to store structured data within PostgreSQL.&lt;/li&gt;
&lt;li&gt;Establish a connection between PostgreSQL and Power BI&lt;/li&gt;
&lt;li&gt;Prepare data for reporting purposes, construct essential loan performance metrics, and design an interactive dashboard.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Use PostgreSQL Together with Power BI?
&lt;/h2&gt;

&lt;p&gt;PostgreSQL ranks among the most widely used relational databases in the field of data analytics. It enables analysts to store structured datasets in an organized manner and carry out transformations through SQL. Power BI, on the other hand, is a business intelligence tool that empowers analysts to build dashboards and interactive reports.&lt;/p&gt;

&lt;p&gt;When combined, the two form a robust workflow: PostgreSQL handles data storage, SQL manages data preparation, and Power BI takes care of visualization and insight generation. This workflow finds broad application across financial analytics, credit risk analysis, fintech analytics, and business intelligence reporting.&lt;/p&gt;

&lt;h3&gt;
  
  
  Project Overview
&lt;/h3&gt;

&lt;p&gt;I want to build a dashboard that answers questions such as:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How many loans have been issued?&lt;/li&gt;
&lt;li&gt;How much money has been disbursed?&lt;/li&gt;
&lt;li&gt;How much has been repaid?&lt;/li&gt;
&lt;li&gt;How much is still outstanding?&lt;/li&gt;
&lt;li&gt;Which counties have the highest borrowing levels?&lt;/li&gt;
&lt;li&gt;What percentage of loans are defaulted?&lt;/li&gt;
&lt;li&gt;How does loan activity change over time?&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;To accomplish this, I will set up a simple relational database consisting of three tables — borrowers, loans, and repayments.  &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I will work with a small sample dataset to keep the project straightforward and easy to follow. This dataset represents a simplified loan portfolio containing borrower information alongside repayment activity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Create Tables in PostgreSQL
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Create the borrowers table&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;borrowers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;borrower_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;borrower_name&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;gender&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;county&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;employment_status&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Create the loans table&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;loans&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;loan_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;borrower_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loan_amount&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;interest_rate&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;loan_term_months&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loan_status&lt;/span&gt; &lt;span class="nb"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;borrower_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;borrowers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;borrower_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Create the repayments table&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;repayments&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;repayment_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loan_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;payment_amount&lt;/span&gt; &lt;span class="nb"&gt;NUMERIC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;payment_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;FOREIGN&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loan_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="n"&gt;loans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loan_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Load Data Into PostgreSQL
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Import the CSV files into PostgreSQL using DBeaver or pgAdmin. Alternatively, insert the data manually using SQL INSERT statements.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 3: Creating a Reporting View
&lt;/h3&gt;

&lt;p&gt;Rather than connecting Power BI directly to raw tables, we build a view that brings together borrower, loan, and repayment data in one place. This makes the reporting process in Power BI significantly more straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;vw_loan_dashboard&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;borrower_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;county&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employment_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interest_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_term_months&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_amount&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_paid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_amount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;payment_amount&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;outstanding_balance&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;loans&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;borrowers&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;borrower_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;borrower_id&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;repayments&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;borrower_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;gender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;county&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employment_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;interest_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_term_months&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loan_status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;This view provides a clean dataset ready for Power BI.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The &lt;code&gt;vw_loan_dashboard&lt;/code&gt; view serves as the reporting layer for this project. It joins the &lt;code&gt;loans&lt;/code&gt;, &lt;code&gt;borrowers&lt;/code&gt;, and &lt;code&gt;repayments&lt;/code&gt; tables into a single clean structure, pulling in borrower details alongside each loan record. A LEFT JOIN is used with the repayments table to ensure that loans with no repayment activity are still captured. From there, two calculated fields are derived — &lt;code&gt;total_paid&lt;/code&gt;, which sums all repayments per loan, and &lt;code&gt;outstanding_balance&lt;/code&gt;, which subtracts the total paid from the original loan amount. This consolidated view makes it straightforward to build meaningful metrics and visualizations in Power BI without managing complex joins on the reporting side.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Step 4: Connect PostgreSQL to Power BI
&lt;/h3&gt;

&lt;p&gt;Launch Power BI Desktop and navigate to Home, then Get Data, and select PostgreSQL Database. Enter your server name along with your database name, then load the following: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;borrowers&lt;/li&gt;
&lt;li&gt;loans&lt;/li&gt;
&lt;li&gt;repayments&lt;/li&gt;
&lt;li&gt;vw_loan_dashboard&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Step 5: Create Power BI measures
&lt;/h3&gt;

&lt;p&gt;In this step we create the following DAX measures&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Total Loans&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Total_Loans = COUNT('loans vw_loan_dashboard'[loan_id])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Total Disbursed Amount&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Total_Disbursed_Amount = SUM('loans vw_loan_dashboard'[loan_amount])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Total Amount Paid&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Total_Amount_Paid = SUM('loans vw_loan_dashboard'[total_paid])&lt;/code&gt;&lt;br&gt;
&lt;em&gt;Total Outstanding Balance&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Total_Outstanding_Balance = SUM('loans vw_loan_dashboard'[outstanding_balance])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Average Loan Size&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Average_Loan_Size = AVERAGE('loans vw_loan_dashboard'[loan_amount])&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Defaulted Loans&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Defaulted_Loans = CALCULATE(COUNT('loans vw_loan_dashboard'[loan_id]),'loans vw_loan_dashboard'[loan_status]="Defaulted")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Default Rate&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Default_Rate = DIVIDE([Defaulted_Loans], [Total_Loans], 0) * 100&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Expected Interest Revenue&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Expected_Interest_Revenue = SUMX('loans vw_loan_dashboard','loans vw_loan_dashboard'[loan_amount] * 'loans vw_loan_dashboard'[interest_rate])&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 6: Build The Dashboard
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Dashboard Layout Example&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47yenrmbx1ng0w2c2ngb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F47yenrmbx1ng0w2c2ngb.png" alt="Sample Layout"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Connecting PostgreSQL with Power BI allows analysts to transform structured financial data into meaningful insights.&lt;/p&gt;

&lt;p&gt;Loan performance dashboards help organizations understand portfolio health, monitor repayment behavior, and identify potential risks.&lt;/p&gt;

</description>
      <category>data</category>
      <category>postgres</category>
      <category>postgressql</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Building an AI-Powered Personalized Learning Platform with FastAPI, PostgreSQL, and Mistral AI</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 10 Mar 2026 08:55:50 +0000</pubDate>
      <link>https://dev.to/john_analytics/building-an-ai-powered-personalized-learning-platform-with-fastapi-postgresql-and-mistral-ai-3834</link>
      <guid>https://dev.to/john_analytics/building-an-ai-powered-personalized-learning-platform-with-fastapi-postgresql-and-mistral-ai-3834</guid>
      <description>&lt;p&gt;Artificial Intelligence is transforming education by enabling systems&lt;br&gt;
that adapt to individual learning needs. In this article, I'll walk&lt;br&gt;
through how I built an &lt;strong&gt;AI-powered personalized learning platform&lt;/strong&gt;&lt;br&gt;
that generates quizzes, tracks student progress, and provides real-time&lt;br&gt;
insights for teachers.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Traditional learning platforms often deliver the same content to every&lt;br&gt;
student, regardless of their performance. However, students learn at&lt;br&gt;
different speeds and struggle with different topics.&lt;/p&gt;

&lt;p&gt;The goal of this project was to build a system that:&lt;/p&gt;

&lt;p&gt;• Generates quizzes automatically using AI&lt;br&gt;
• Tracks student learning behavior&lt;br&gt;
• Detects struggling students&lt;br&gt;
• Provides teachers with data-driven insights&lt;/p&gt;


&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The system consists of four main components:&lt;/p&gt;

&lt;p&gt;Student Interaction Layer&lt;/p&gt;

&lt;p&gt;FastAPI Backend&lt;/p&gt;

&lt;p&gt;PostgreSQL Database&lt;/p&gt;

&lt;p&gt;AI Engine (Mistral)&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture overview:
&lt;/h2&gt;

&lt;p&gt;Students&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;FastAPI API&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;PostgreSQL Database&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;Mistral AI&lt;/p&gt;

&lt;p&gt;↓&lt;/p&gt;

&lt;p&gt;Analytics Dashboard&lt;/p&gt;


&lt;h2&gt;
  
  
  AI Quiz Generation
&lt;/h2&gt;

&lt;p&gt;Instead of manually creating quizzes, the platform uses &lt;strong&gt;Mistral AI&lt;/strong&gt;&lt;br&gt;
to generate questions dynamically.&lt;/p&gt;

&lt;p&gt;Example API endpoint:&lt;/p&gt;

&lt;p&gt;GET /generate-quiz/algebra&lt;/p&gt;

&lt;p&gt;The AI returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Question&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Multiple choice answers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Correct answer&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Explanation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the platform to generate unlimited quizzes for any topic.&lt;/p&gt;


&lt;h2&gt;
  
  
  Real-Time Feedback
&lt;/h2&gt;

&lt;p&gt;When students submit answers, the backend evaluates correctness and generates explanations.&lt;/p&gt;

&lt;p&gt;POST /submit-answer&lt;/p&gt;

&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;correct&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="s"&gt;\&lt;/span&gt;
&lt;span class="na"&gt;score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100\&lt;/span&gt;
&lt;span class="na"&gt;feedback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Explanation of the answer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Students receive immediate feedback, improving engagement and learning efficiency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Adaptive Learning
&lt;/h2&gt;

&lt;p&gt;One of the most important features is &lt;strong&gt;adaptive difficulty&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If a student performs well, the system generates harder questions.&lt;/p&gt;

&lt;p&gt;If a student struggles, the system provides simpler explanations and&lt;br&gt;
easier quizzes.&lt;/p&gt;

&lt;p&gt;This creates a personalized learning experience.&lt;/p&gt;


&lt;h2&gt;
  
  
  Data Analytics with SQL
&lt;/h2&gt;

&lt;p&gt;Every interaction is stored in PostgreSQL, allowing powerful analytics.&lt;/p&gt;

&lt;p&gt;Example insights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Average student performance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Topic difficulty analysis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Learning trends over time&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Detection of struggling students&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example SQL query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;quiz_results&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;student_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Teacher Dashboard
&lt;/h2&gt;

&lt;p&gt;To visualize insights, I built a &lt;strong&gt;Streamlit dashboard&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Teachers can view:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Student performance &lt;/li&gt;
&lt;li&gt;Difficult topics&lt;/li&gt;
&lt;li&gt;Performance trends&lt;/li&gt;
&lt;li&gt;At-risk students&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows educators to identify learning gaps early.&lt;/p&gt;




&lt;h2&gt;
  
  
  Technologies Used
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;FastAPI&lt;/li&gt;
&lt;li&gt;PostgreSQL&lt;/li&gt;
&lt;li&gt;Mistral AI&lt;/li&gt;
&lt;li&gt;SQL Analytics&lt;/li&gt;
&lt;li&gt;Streamlit&lt;/li&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;AI-powered learning platforms have the potential to transform education by making learning &lt;strong&gt;personalized, adaptive, and data-driven&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This project is a simplified prototype of what modern EdTech platforms can achieve using open-source tools and AI models.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>fastapi</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title># Understanding Joins and Window Functions in SQL</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:31:49 +0000</pubDate>
      <link>https://dev.to/john_analytics/-understanding-joins-and-window-functions-in-sql-293f</link>
      <guid>https://dev.to/john_analytics/-understanding-joins-and-window-functions-in-sql-293f</guid>
      <description>&lt;p&gt;When working with relational databases, data is rarely stored in one single table. Instead, it is organized into multiple related tables to reduce redundancy and improve structure.&lt;/p&gt;

&lt;p&gt;In a typical transactional database, you might have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;customers&lt;/code&gt; → customer information
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orders&lt;/code&gt; → order transactions
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;books&lt;/code&gt; → product details
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To analyze meaningful business insights, we must first combine this data. That’s where &lt;strong&gt;SQL Joins&lt;/strong&gt; come in.&lt;/p&gt;

&lt;p&gt;Once the data is combined, we can apply &lt;strong&gt;Window Functions&lt;/strong&gt; to perform advanced analysis such as ranking, running totals, and trend comparisons.&lt;/p&gt;

&lt;p&gt;This article walks you through both concepts in a logical flow — starting with joins and finishing with window functions.&lt;/p&gt;




&lt;h2&gt;
  
  
  SQL JOINS
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Why Joins Matter
&lt;/h2&gt;

&lt;p&gt;Relational databases are built on relationships.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A customer places many orders.&lt;/li&gt;
&lt;li&gt;An order references one book.&lt;/li&gt;
&lt;li&gt;A book can appear in many orders.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To analyze this properly, we must join the tables together using a shared key — usually &lt;code&gt;customer_id&lt;/code&gt; or &lt;code&gt;book_id&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  INNER JOIN
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;INNER JOIN&lt;/strong&gt; returns only rows that exist in both tables.&lt;/p&gt;

&lt;p&gt;If we want to see which customers placed orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔ Returns only customers who have placed orders&lt;br&gt;&lt;br&gt;
❌ Excludes customers with no orders  &lt;/p&gt;

&lt;p&gt;This is the most commonly used join in data analysis.&lt;/p&gt;


&lt;h2&gt;
  
  
  LEFT JOIN (LEFT OUTER JOIN)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;LEFT JOIN&lt;/strong&gt; returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All rows from the left table
&lt;/li&gt;
&lt;li&gt;Matching rows from the right table
&lt;/li&gt;
&lt;li&gt;NULL where no match exists
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: Show all customers, even those who haven’t ordered anything.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔ Every customer appears&lt;br&gt;&lt;br&gt;
✔ Customers without orders will show NULL in order columns  &lt;/p&gt;

&lt;p&gt;This is extremely useful for identifying inactive customers.&lt;/p&gt;


&lt;h2&gt;
  
  
  RIGHT JOIN
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;RIGHT JOIN&lt;/strong&gt; does the opposite of a LEFT JOIN.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;RIGHT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures all orders appear, even if customer data is missing.&lt;/p&gt;

&lt;p&gt;In practice, RIGHT JOIN is less common because we can usually rewrite it using LEFT JOIN by switching table order.&lt;/p&gt;




&lt;h2&gt;
  
  
  FULL JOIN (FULL OUTER JOIN)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;FULL JOIN&lt;/strong&gt; returns all rows from both tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;FULL&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔ Shows matches&lt;br&gt;&lt;br&gt;
✔ Shows unmatched customers&lt;br&gt;&lt;br&gt;
✔ Shows unmatched orders  &lt;/p&gt;

&lt;p&gt;This is useful for data auditing and reconciliation.&lt;/p&gt;


&lt;h2&gt;
  
  
  Joining Multiple Tables
&lt;/h2&gt;

&lt;p&gt;Real-world analysis often requires more than two tables.&lt;/p&gt;

&lt;p&gt;Example: See which customer ordered which book.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;books&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer name
&lt;/li&gt;
&lt;li&gt;Book title
&lt;/li&gt;
&lt;li&gt;Quantity ordered
&lt;/li&gt;
&lt;li&gt;Order date
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This joined dataset becomes the foundation for deeper analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  CROSS JOIN
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;CROSS JOIN&lt;/strong&gt; produces all possible combinations between two tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;books&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 customers
&lt;/li&gt;
&lt;li&gt;5 books
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You get 50 rows.&lt;/p&gt;

&lt;p&gt;This is useful when generating combinations for simulations or recommendation systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Anti-Join (Finding Missing Records)
&lt;/h2&gt;

&lt;p&gt;SQL doesn’t have a direct &lt;code&gt;ANTI JOIN&lt;/code&gt;, but we simulate it using LEFT JOIN + NULL filtering.&lt;/p&gt;

&lt;p&gt;Example: Find customers who have never placed an order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is powerful for churn analysis and business reporting.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Joins Prepare Data for Window Functions
&lt;/h2&gt;

&lt;p&gt;Notice something important:&lt;/p&gt;

&lt;p&gt;Most advanced SQL analysis begins like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Joins combine related data&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Window functions analyze that combined dataset&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Joins prepare the structure.&lt;br&gt;&lt;br&gt;
Window functions perform the analytics.&lt;/p&gt;

&lt;p&gt;Now that we understand joins, let’s move into advanced analytics.&lt;/p&gt;


&lt;h2&gt;
  
  
  WINDOW FUNCTIONS
&lt;/h2&gt;

&lt;p&gt;They allow you to perform calculations across a set of table rows that are somehow related to the current row.&lt;/p&gt;

&lt;p&gt;Window functions can be used to  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rank Rows.
&lt;/li&gt;
&lt;li&gt;Calculate cumulative totals.
&lt;/li&gt;
&lt;li&gt;Find the difference between consecutive rows in a dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Window functions return a value for each row while still providing information from the related rows.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  ROW_NUMBER ()
&lt;/h2&gt;

&lt;p&gt;Assign a unique row number to each row in the result set.&lt;/p&gt;

&lt;p&gt;In a real world scenario it can help us track which order each customer made first, second.......&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Assigns a unique number to each row, starting from 1 based on the order specified by the ORDER BY clause.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The number will reset for each position if PARTITION BY is used.&lt;/p&gt;

&lt;p&gt;Assign a unique row number to each order based on the order date and we want to reset numbering for each customer&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ROW_NUMBER () : Assigns a unique number to each order.&lt;br&gt;&lt;br&gt;
PARTITION BY o.customer_id : Ensures that the row numbering starts fresh for each customer.&lt;br&gt;&lt;br&gt;
Query will list orders for each customer showing their row number (1,2,3---) in the sequence of orders.&lt;/p&gt;

&lt;p&gt;Ranking the orders globally based on order date without resetting the numbering for each customer&lt;br&gt;&lt;br&gt;
What is needed just remove the PARTITION BY Clause&lt;/p&gt;

&lt;p&gt;SQL Query Without Resetting Row Number:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Without PARTITION BY the numbering is continous for all orders across customers based on their order_date.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  RANK() AND DENSE_RANK()
&lt;/h2&gt;

&lt;p&gt;RANK() assigns a rank to each row, with ties getting the same rank but leaving gaps in subsequent ranks.&lt;br&gt;&lt;br&gt;
DENSE_RANK() works similarly but without leaving gaps in the ranking.&lt;/p&gt;
&lt;h3&gt;
  
  
  RANK() SQL QUERY
&lt;/h3&gt;

&lt;p&gt;Rank the customers based on the total quantity of books as they are ordered&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RANK() assigns a rank based on the SUM(o.quantity) in descending order.&lt;br&gt;&lt;br&gt;
If two customers have the same total quantity ordered they will receive the same rank and the next rank will have a gap. *Two customers rank 1 will result in the next customer being ranked 3rd.&lt;/p&gt;
&lt;h3&gt;
  
  
  USING DENSE_RANK()
&lt;/h3&gt;

&lt;p&gt;Assigns a rank without gaps for ties.&lt;/p&gt;

&lt;p&gt;Assigns a rank to each row but it does not leave gaps in the rankings if there are ties.&lt;/p&gt;

&lt;p&gt;Calculate the dense rank of customers based on the total quantity of books they ordered&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dense_rank&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;dense_rank&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If two customers are tied they will receive the same rank but the next customer will receive the next consecutive rank.&lt;br&gt;&lt;br&gt;
1, 1, next will be 2nd&lt;br&gt;&lt;br&gt;
This key difference in RANK() and DENSE_RANK() is crucial for how you want to treat tied values in your analysis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  LEAD() AND LAG()
&lt;/h2&gt;

&lt;h3&gt;
  
  
  LEAD()
&lt;/h3&gt;

&lt;p&gt;LEAD() access next row's value.&lt;br&gt;&lt;br&gt;
It is used to access a row that follows the current row at a specific physical offset.&lt;br&gt;&lt;br&gt;
Generally employed to compare the value of the current row with the value of the next row following the current row.&lt;/p&gt;

&lt;p&gt;Compare quantity ordered by each customer in the current row with the quantity ordered in the next row&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;LEAD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;next_quantity&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;LEAD(o.quantity) allows you to access the quantity of the next row for each customer.&lt;br&gt;&lt;br&gt;
Query gives the quantity ordered by the customer in the current row and the quantity ordered by the same customer in the next row.&lt;br&gt;&lt;br&gt;
For the last row for each customer the next quantity will be NULL because there is no next row.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  LAG()
&lt;/h3&gt;

&lt;p&gt;Access previous rows value&lt;br&gt;&lt;br&gt;
It is crucial for analyzing trends or behavior change over time.&lt;br&gt;&lt;br&gt;
Allows you to access data from a previous row within the same result set and is crucial for comparing values in the current row with the preceding row.&lt;br&gt;&lt;br&gt;
Operates on partitions created by the PARTITION BY clause.&lt;/p&gt;

&lt;p&gt;Compare quantity ordered by each customer in the current row with the quantity ordered in the previous row using the LAG() function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;LAG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_quantity&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;LAG(o.quantity) allows you to access the quantity ordered in the previous row for each customer&lt;br&gt;&lt;br&gt;
The query shows the quantity ordered in the current row and the quantity ordered in the previous row for the same customer.&lt;br&gt;&lt;br&gt;
First row previous quantity will be NULL as there is no previous row.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  NTILE() FUNCTION
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Partitions data into specified number of buckets.&lt;br&gt;&lt;br&gt;
Crucial for data analysis and reporting as it allows users to efficiently distribute rows and analyze data in a structured manner.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We want to divide customers into 2 groups (quartiles) based on their total order quantity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="n"&gt;NTILE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;quantity_tile&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;quantity_tile&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NTILE(2) divides customers into two equal groups (quartiles) based on their total quantity ordered.&lt;/p&gt;

&lt;h2&gt;
  
  
  PARTITION BY
&lt;/h2&gt;

&lt;p&gt;Divides result set into partitions to apply window functions independently.&lt;/p&gt;

&lt;p&gt;This clause divides the result set into partitions and the window function works independently within each partition.&lt;/p&gt;

&lt;p&gt;calculate the total quantity of orders for each customer and the average price of the books ordered by each customer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_price&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
 &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_order_quantity&lt;/span&gt;
 &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;books&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;book_id&lt;/span&gt;
 &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;second_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
 &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;SUM(o.quantity) gives total quantity ordered by each customer.&lt;br&gt;&lt;br&gt;
PARTITION BY o.customer_id ensures the total order quantity is calculated for each individual customer.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>sql</category>
      <category>postgres</category>
      <category>data</category>
      <category>database</category>
    </item>
    <item>
      <title>## I Built an AI Tourism Assistant for Kenya Using RAG, pgvector, and Streamlit</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Wed, 04 Mar 2026 09:21:41 +0000</pubDate>
      <link>https://dev.to/john_analytics/-i-built-an-ai-tourism-assistant-for-kenya-using-rag-pgvector-and-streamlit-c71</link>
      <guid>https://dev.to/john_analytics/-i-built-an-ai-tourism-assistant-for-kenya-using-rag-pgvector-and-streamlit-c71</guid>
      <description>&lt;p&gt;Imagine asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What's the best luxury safari in Maasai Mara?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and instantly getting &lt;strong&gt;personalized travel recommendations powered by&lt;br&gt;
AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's exactly what I built --- an &lt;strong&gt;AI Tourism Intelligence Assistant&lt;/strong&gt;&lt;br&gt;
that helps travelers discover the best travel packages in Kenya based on&lt;br&gt;
their &lt;strong&gt;budget, travel style, duration, and preferred destination.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this article, I'll walk you through:&lt;/p&gt;

&lt;p&gt;• The idea behind the project&lt;br&gt;
• How I built the AI recommendation system&lt;br&gt;
• The RAG architecture powering it&lt;br&gt;
• How vector search makes travel discovery smarter&lt;br&gt;
• Deployment with Streamlit&lt;/p&gt;




&lt;h2&gt;
  
  
  ✨ The Idea
&lt;/h2&gt;

&lt;p&gt;Kenya is one of the world's most beautiful tourism destinations,&lt;br&gt;
offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Wildlife safaris 🦁&lt;/li&gt;
&lt;li&gt;  Tropical beaches 🏝&lt;/li&gt;
&lt;li&gt;  Mountain adventures ⛰&lt;/li&gt;
&lt;li&gt;  Cultural experiences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But planning trips can be frustrating because:&lt;/p&gt;

&lt;p&gt;• Travel packages are scattered across multiple websites&lt;br&gt;
• Platforms rarely provide &lt;strong&gt;personalized recommendations&lt;/strong&gt;&lt;br&gt;
• Comparing destinations based on budget or style is difficult&lt;/p&gt;

&lt;p&gt;So I decided to build an &lt;strong&gt;AI-powered tourism assistant&lt;/strong&gt; that could:&lt;/p&gt;

&lt;p&gt;✔ Understand traveler preferences&lt;br&gt;
✔ Retrieve relevant travel packages&lt;br&gt;
✔ Generate intelligent recommendations&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 What the AI Assistant Does
&lt;/h2&gt;

&lt;p&gt;Users simply input their preferences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Budget&lt;/li&gt;
&lt;li&gt;  Travel duration&lt;/li&gt;
&lt;li&gt;  Travel style&lt;/li&gt;
&lt;li&gt;  Preferred destination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system then returns &lt;strong&gt;relevant travel packages from a tourism&lt;br&gt;
database.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example query:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Budget: $2000&lt;br&gt;
Days: 5&lt;br&gt;
Style: Relaxing&lt;br&gt;
Destination: Diani&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The assistant responds with &lt;strong&gt;recommended travel packages matching those&lt;br&gt;
criteria.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ Tech Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Programming
&lt;/h3&gt;

&lt;p&gt;Python&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Engineering
&lt;/h3&gt;

&lt;p&gt;PostgreSQL&lt;br&gt;
pgvector&lt;/p&gt;

&lt;h3&gt;
  
  
  AI
&lt;/h3&gt;

&lt;p&gt;Mistral AI embeddings&lt;br&gt;
Retrieval-Augmented Generation (RAG)&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Collection
&lt;/h3&gt;

&lt;p&gt;Playwright&lt;br&gt;
BeautifulSoup&lt;/p&gt;

&lt;h3&gt;
  
  
  Backend
&lt;/h3&gt;

&lt;p&gt;SQLAlchemy&lt;/p&gt;

&lt;h3&gt;
  
  
  Frontend
&lt;/h3&gt;

&lt;p&gt;Streamlit&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment
&lt;/h3&gt;

&lt;p&gt;Streamlit Cloud&lt;br&gt;
Neon PostgreSQL&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗 System Architecture
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tourism Websites
      │
      ▼
Web Scraping (Playwright)
      │
      ▼
PostgreSQL Database
      │
      ▼
Embedding Generation (Mistral AI)
      │
      ▼
Vector Database (pgvector)
      │
      ▼
Recommendation Engine
      │
      ▼
Streamlit Web Application
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  🔎 How the RAG System Works
&lt;/h2&gt;

&lt;p&gt;The project uses &lt;strong&gt;Retrieval‑Augmented Generation (RAG)&lt;/strong&gt; to deliver&lt;br&gt;
intelligent responses.&lt;/p&gt;

&lt;p&gt;Instead of the AI guessing answers, it &lt;strong&gt;retrieves real travel packages&lt;br&gt;
from the database first.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pipeline:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query
     │
     ▼
Convert Query → Embedding
     │
     ▼
Vector Similarity Search
     │
     ▼
Retrieve Relevant Travel Packages
     │
     ▼
Generate Personalized Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;This ensures the AI responds with &lt;strong&gt;real tourism data rather than&lt;br&gt;
hallucinations.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🗄 Database Design
&lt;/h2&gt;

&lt;p&gt;The database stores travel information in structured tables such as:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;travel_packages
destinations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Each travel package contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Package name&lt;/li&gt;
&lt;li&gt;  Destination&lt;/li&gt;
&lt;li&gt;  Duration&lt;/li&gt;
&lt;li&gt;  Price&lt;/li&gt;
&lt;li&gt;  Description&lt;/li&gt;
&lt;li&gt;  Vector embedding&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔍 Why Vector Search Matters
&lt;/h2&gt;

&lt;p&gt;Traditional search relies on keywords.&lt;/p&gt;

&lt;p&gt;Vector search understands &lt;strong&gt;meaning and context.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, if a user searches:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Affordable safari in Kenya"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system can still return:&lt;/p&gt;

&lt;p&gt;• Budget Maasai Mara packages&lt;br&gt;
• Lake Nakuru safari deals&lt;br&gt;
• Amboseli wildlife tours&lt;/p&gt;

&lt;p&gt;Even if those exact words were not used.&lt;/p&gt;




&lt;h2&gt;
  
  
  💻 Building the Interface
&lt;/h2&gt;

&lt;p&gt;The frontend is built using &lt;strong&gt;Streamlit&lt;/strong&gt;, which makes it easy to create&lt;br&gt;
interactive data apps.&lt;/p&gt;

&lt;p&gt;Users can:&lt;/p&gt;

&lt;p&gt;✔ Enter travel preferences&lt;br&gt;
✔ Browse travel packages&lt;br&gt;
✔ Receive AI‑powered recommendations&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Deployment
&lt;/h2&gt;

&lt;p&gt;The application is deployed using:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streamlit Cloud&lt;/strong&gt; for hosting the web app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Neon PostgreSQL&lt;/strong&gt; for the managed database.&lt;/p&gt;

&lt;p&gt;This allows the project to run fully online.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Key Results
&lt;/h2&gt;

&lt;p&gt;The project successfully delivers:&lt;/p&gt;

&lt;p&gt;✔ AI-powered tourism recommendations&lt;br&gt;
✔ Semantic search using vector embeddings&lt;br&gt;
✔ A fully deployed web application&lt;br&gt;
✔ Personalized travel package discovery&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Web Scraping Complexity
&lt;/h3&gt;

&lt;p&gt;Many travel websites load content dynamically, which required&lt;br&gt;
&lt;strong&gt;Playwright&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Quality Issues
&lt;/h3&gt;

&lt;p&gt;Scraped data often contained:&lt;/p&gt;

&lt;p&gt;• Missing prices&lt;br&gt;
• Duplicate packages&lt;br&gt;
• Inconsistent destination names&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding Rate Limits
&lt;/h3&gt;

&lt;p&gt;Embedding generation triggered &lt;strong&gt;API rate limits&lt;/strong&gt;, requiring retry&lt;br&gt;
logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployment Configuration
&lt;/h3&gt;

&lt;p&gt;Deployment required careful setup of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Environment variables&lt;/li&gt;
&lt;li&gt;  Streamlit secrets&lt;/li&gt;
&lt;li&gt;  Database connection strings&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔮 Future Improvements
&lt;/h2&gt;

&lt;p&gt;Future versions of the system could include:&lt;/p&gt;

&lt;p&gt;• AI itinerary generation&lt;br&gt;
• Social media tourism trend analysis&lt;br&gt;
• Integration with booking APIs&lt;br&gt;
• User accounts and saved trips&lt;/p&gt;




&lt;h2&gt;
  
  
  🌍 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Combining &lt;strong&gt;vector databases, AI retrieval systems, and interactive web&lt;br&gt;
apps&lt;/strong&gt; opens powerful opportunities for building intelligent data&lt;br&gt;
products.&lt;/p&gt;

&lt;p&gt;This project demonstrates how AI can improve &lt;strong&gt;tourism discovery and&lt;br&gt;
travel planning&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>rag</category>
      <category>postgres</category>
      <category>etl</category>
      <category>data</category>
    </item>
    <item>
      <title>🏠 Building a Machine Learning Property Price Predictor (From Web Scraping to Deployment</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Mon, 23 Feb 2026 07:53:09 +0000</pubDate>
      <link>https://dev.to/john_analytics/building-a-machine-learning-property-price-predictor-from-web-scraping-to-deployment-2ma5</link>
      <guid>https://dev.to/john_analytics/building-a-machine-learning-property-price-predictor-from-web-scraping-to-deployment-2ma5</guid>
      <description>&lt;p&gt;In this project, I built a complete end-to-end machine learning system&lt;br&gt;
that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Scrapes property listings&lt;/li&gt;
&lt;li&gt;  Cleans and engineers features&lt;/li&gt;
&lt;li&gt;  Trains multiple ML models&lt;/li&gt;
&lt;li&gt;  Deploys a pricing app&lt;/li&gt;
&lt;li&gt;  Builds a business-ready dashboard&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article walks through the entire pipeline  from raw web data to a deployed ML product.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 1 --- Web Scraping
&lt;/h2&gt;

&lt;p&gt;I built a Selenium scraper to extract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Location&lt;/li&gt;
&lt;li&gt;  Property Type&lt;/li&gt;
&lt;li&gt;  Bedrooms&lt;/li&gt;
&lt;li&gt;  Bathrooms&lt;/li&gt;
&lt;li&gt;  Size (sqm)&lt;/li&gt;
&lt;li&gt;  Amenities&lt;/li&gt;
&lt;li&gt;  Price (KES)&lt;/li&gt;
&lt;li&gt;  Listing Date&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Sample Scraping Logic
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;listings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;property&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;card&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;listings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.//a[contains(@href,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/listings/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;property_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sample Scraping Logic
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;listings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_elements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//div[contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;listing&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;property&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or contains(@class,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;card&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;listings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;listing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;By&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;XPATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.//a[contains(@href,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/listings/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;property_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;href&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 --- Exploratory Analysis
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Most Expensive Locations
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;location_prices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Price (KES)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;sort_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ascending&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;location_prices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4 --- Modeling
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Train/Test Split
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.model_selection&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;train_test_split&lt;/span&gt;

&lt;span class="n"&gt;X&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bedrooms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bathrooms&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Size (sqm)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amenity_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Price (KES)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Linear Regression (Baseline)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.linear_model&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LinearRegression&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mean_absolute_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r2_score&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;lr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LinearRegression&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;mae&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mean_absolute_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rmse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mean_squared_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;r2_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Random Forest
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.ensemble&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RandomForestRegressor&lt;/span&gt;

&lt;span class="n"&gt;rf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rf_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  XGBoost
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;xgboost&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;XGBRegressor&lt;/span&gt;

&lt;span class="n"&gt;xgb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;XGBRegressor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;learning_rate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;xgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;xgb_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;xgb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_test&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5 --- Deployment (Streamlit App)
&lt;/h2&gt;

&lt;p&gt;The pricing app allows users to input:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Location&lt;/li&gt;
&lt;li&gt;  Bedrooms&lt;/li&gt;
&lt;li&gt;  Bathrooms&lt;/li&gt;
&lt;li&gt;  Size&lt;/li&gt;
&lt;li&gt;  Amenities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And returns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Predicted price&lt;/li&gt;
&lt;li&gt;  Estimated range (± MAE)&lt;/li&gt;
&lt;li&gt;  Explanation of price drivers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;streamlit run Streamlit_app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 6 --- Executive Dashboard
&lt;/h2&gt;

&lt;p&gt;Built using Streamlit with interactive filters.&lt;/p&gt;

&lt;p&gt;Includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Median price by location&lt;/li&gt;
&lt;li&gt;  Monthly price trends&lt;/li&gt;
&lt;li&gt;  Price per sqft comparison&lt;/li&gt;
&lt;li&gt;  Amenity impact analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;streamlit run Dashboard.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key Insights
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Size is the strongest determinant of price.&lt;/li&gt;
&lt;li&gt;  Premium neighborhoods significantly increase valuation.&lt;/li&gt;
&lt;li&gt;  Amenities increase value but are secondary drivers.&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>webscraping</category>
      <category>streamlit</category>
      <category>python</category>
      <category>data</category>
    </item>
    <item>
      <title>From Messy Data to Confident Decisions: How Analysts Use Power BI, DAX, and Dashboards in the Real World</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 10 Feb 2026 08:48:51 +0000</pubDate>
      <link>https://dev.to/john_analytics/from-messy-data-to-confident-decisions-how-analysts-use-power-bi-dax-and-dashboards-in-the-real-4860</link>
      <guid>https://dev.to/john_analytics/from-messy-data-to-confident-decisions-how-analysts-use-power-bi-dax-and-dashboards-in-the-real-4860</guid>
      <description>&lt;p&gt;Power BI skills are often misunderstood as “just reporting.” In reality, professional analysts use Power BI as a &lt;strong&gt;decision-support system&lt;/strong&gt; — one that transforms messy, unreliable data into insights leaders trust to allocate budgets, adjust strategy, and measure performance.&lt;/p&gt;

&lt;p&gt;This article demonstrates how &lt;strong&gt;technical Power BI skills translate directly into real-world business decisions and measurable impact&lt;/strong&gt;, following the same workflow used in real organizations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Messy Data Is a Business Risk, Not a Technical Issue
&lt;/h2&gt;

&lt;p&gt;In real organizations, data arrives incomplete and inconsistent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Regions spelled differently across systems&lt;/li&gt;
&lt;li&gt;Missing transaction dates&lt;/li&gt;
&lt;li&gt;Revenue stored as text&lt;/li&gt;
&lt;li&gt;Duplicate customer records&lt;/li&gt;
&lt;li&gt;Placeholder values like &lt;code&gt;N/A&lt;/code&gt; and &lt;code&gt;Error&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When these issues are ignored, dashboards show incorrect KPIs and misleading trends.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;p&gt;When analysts clean data correctly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Financial metrics become trustworthy&lt;/li&gt;
&lt;li&gt;Performance comparisons are accurate&lt;/li&gt;
&lt;li&gt;Leaders focus on decisions instead of debating numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why analysts begin in &lt;strong&gt;Power Query&lt;/strong&gt;, not visuals.&lt;/p&gt;




&lt;h2&gt;
  
  
  Power Query: Turning Raw Inputs into Reliable Data
&lt;/h2&gt;

&lt;p&gt;Power Query is where analysts reduce business risk.&lt;/p&gt;

&lt;p&gt;Using repeatable transformation steps, analysts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standardize categories for consistent grouping&lt;/li&gt;
&lt;li&gt;Remove invalid or duplicate records&lt;/li&gt;
&lt;li&gt;Apply correct data types for calculations and time analysis&lt;/li&gt;
&lt;li&gt;Replace pseudo-blanks with true null values&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Outcome
&lt;/h3&gt;

&lt;p&gt;After proper Power Query transformations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monthly revenue no longer fluctuates unexpectedly&lt;/li&gt;
&lt;li&gt;Forecasts align with finance systems&lt;/li&gt;
&lt;li&gt;Data refreshes produce consistent results automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reliable data is the foundation of every decision.&lt;/p&gt;




&lt;h2&gt;
  
  
  Data Modeling: Structuring Data for Decision-Making
&lt;/h2&gt;

&lt;p&gt;Power BI does not analyze spreadsheets — it analyzes &lt;strong&gt;data models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Professional analysts design &lt;strong&gt;star schemas&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fact tables&lt;/strong&gt; store measurable business events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dimension tables&lt;/strong&gt; provide descriptive context&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;Well-designed models ensure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable filter behavior&lt;/li&gt;
&lt;li&gt;Accurate KPIs across dashboards&lt;/li&gt;
&lt;li&gt;Strong performance as data volumes grow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Poor modeling leads to conflicting answers and erodes stakeholder trust.&lt;/p&gt;




&lt;h2&gt;
  
  
  DAX: Translating Business Questions into Logic
&lt;/h2&gt;

&lt;p&gt;DAX allows analysts to express business logic directly in calculations.&lt;/p&gt;

&lt;p&gt;Executives ask questions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are we improving compared to last year?&lt;/li&gt;
&lt;li&gt;Which regions are underperforming?&lt;/li&gt;
&lt;li&gt;How close are we to our targets?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using DAX, analysts move beyond raw totals to meaningful metrics like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profit margins&lt;/li&gt;
&lt;li&gt;Year-over-year growth&lt;/li&gt;
&lt;li&gt;Year-to-date performance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Business Impact
&lt;/h3&gt;

&lt;p&gt;DAX enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fair comparisons across time&lt;/li&gt;
&lt;li&gt;KPI tracking against targets&lt;/li&gt;
&lt;li&gt;Scenario-based decision-making&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Without DAX, dashboards show numbers.  With DAX, they show &lt;strong&gt;meaning&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Time Intelligence: Supporting Strategic Decisions
&lt;/h2&gt;

&lt;p&gt;Time intelligence allows organizations to understand performance trends.&lt;/p&gt;

&lt;p&gt;Using time-based analysis, analysts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compare current results to prior periods&lt;/li&gt;
&lt;li&gt;Detect early signs of growth or decline&lt;/li&gt;
&lt;li&gt;Measure progress toward annual goals&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Decisions Enabled
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Expanding high-growth regions&lt;/li&gt;
&lt;li&gt;Addressing seasonal declines proactively&lt;/li&gt;
&lt;li&gt;Adjusting forecasts based on YTD performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Time intelligence transforms historical data into &lt;strong&gt;forward-looking insight&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dashboards: From Information to Action
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Dashboards are not collections of charts.
They are &lt;strong&gt;decision interfaces&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Effective dashboards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highlight critical KPIs&lt;/li&gt;
&lt;li&gt;Show trends requiring attention&lt;/li&gt;
&lt;li&gt;Surface underperformance and exceptions&lt;/li&gt;
&lt;li&gt;Enable fast filtering without technical effort&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Measurable Outcomes
&lt;/h3&gt;

&lt;p&gt;Well-designed dashboards help organizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce time spent validating numbers&lt;/li&gt;
&lt;li&gt;Detect issues earlier&lt;/li&gt;
&lt;li&gt;Align teams around shared metrics&lt;/li&gt;
&lt;li&gt;Act faster with confidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dashboards succeed when users know &lt;strong&gt;what to do next&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring Success: Business Impact of Power BI
&lt;/h2&gt;

&lt;p&gt;When Power BI is used effectively:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Leaders trust the data without manual validation&lt;/li&gt;
&lt;li&gt;Decisions are backed by consistent metrics&lt;/li&gt;
&lt;li&gt;Reporting effort decreases&lt;/li&gt;
&lt;li&gt;Performance improvements are measurable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real value of Power BI is not the report —&lt;br&gt;&lt;br&gt;
it is the &lt;strong&gt;decisions enabled by the report&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion: Power BI as a Strategic Asset
&lt;/h2&gt;

&lt;p&gt;Analysts create measurable impact by combining:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Power Query for data reliability
&lt;/li&gt;
&lt;li&gt;Data modeling for meaningful analysis
&lt;/li&gt;
&lt;li&gt;DAX for business logic
&lt;/li&gt;
&lt;li&gt;Dashboards designed for action
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is how messy data becomes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trusted insights&lt;/li&gt;
&lt;li&gt;Confident decisions&lt;/li&gt;
&lt;li&gt;Real business outcomes&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;Power BI, when used professionally, is not a reporting tool —  it is a &lt;strong&gt;strategic decision-making engine&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




</description>
      <category>powerbi</category>
      <category>data</category>
      <category>analytics</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>📊 Understanding Schemas and Data Modelling in Power BI</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Thu, 05 Feb 2026 11:13:12 +0000</pubDate>
      <link>https://dev.to/john_analytics/understanding-schemas-and-data-modelling-in-power-bi-2epd</link>
      <guid>https://dev.to/john_analytics/understanding-schemas-and-data-modelling-in-power-bi-2epd</guid>
      <description>&lt;p&gt;Data modelling is the foundation of building scalable and&lt;br&gt;
high-performance dashboards in Power BI. While many developers focus&lt;br&gt;
heavily on visuals and DAX calculations, the true performance and&lt;br&gt;
accuracy of a report depend heavily on how data is structured.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧱 What Is Data Modelling in Power BI?
&lt;/h2&gt;

&lt;p&gt;Data modelling refers to structuring data into logical formats that&lt;br&gt;
support analysis and reporting. Dimensional modelling organizes data&lt;br&gt;
into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Facts (measurable metrics)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dimensions (descriptive attributes)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A well-designed Power BI model determines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  How tables relate&lt;/li&gt;
&lt;li&gt;  How filters propagate&lt;/li&gt;
&lt;li&gt;  How fast reports load&lt;/li&gt;
&lt;li&gt;  How accurate calculations are&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⭐ Star Schema
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ What Is Star Schema?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Star Schema&lt;/strong&gt; organizes data using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  One &lt;strong&gt;central fact table&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Multiple &lt;strong&gt;dimension tables&lt;/strong&gt; connected directly to the fact table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The structure resembles a star, where dimension tables surround the&lt;br&gt;
central fact table.&lt;/p&gt;




&lt;h3&gt;
  
  
  📊 Components of Star Schema
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Fact Table
&lt;/h4&gt;

&lt;p&gt;Contains measurable and quantitative data such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Sales revenue&lt;/li&gt;
&lt;li&gt;  Quantity sold&lt;/li&gt;
&lt;li&gt;  Profit&lt;/li&gt;
&lt;li&gt;  Discount&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each row represents a business event like a transaction.&lt;/p&gt;

&lt;h4&gt;
  
  
  Dimension Tables
&lt;/h4&gt;

&lt;p&gt;Provide descriptive context to fact data such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Customer details&lt;/li&gt;
&lt;li&gt;  Product attributes&lt;/li&gt;
&lt;li&gt;  Date/Time&lt;/li&gt;
&lt;li&gt;  Store location&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🚀 Benefits of Star Schema
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  ✔ High query performance&lt;/li&gt;
&lt;li&gt;  ✔ Simple design&lt;/li&gt;
&lt;li&gt;  ✔ Easier DAX calculations&lt;/li&gt;
&lt;li&gt;  ✔ Optimized for reporting and dashboards&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Data redundancy&lt;/li&gt;
&lt;li&gt;  Higher storage usage&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ❄️ Snowflake Schema
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ✅ What Is Snowflake Schema?
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;Snowflake Schema&lt;/strong&gt; extends star schema by normalizing dimension&lt;br&gt;
tables into multiple related tables.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Customer → City → Country&lt;/p&gt;




&lt;h3&gt;
  
  
  📌 Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Normalized dimension tables&lt;/li&gt;
&lt;li&gt;  Supports hierarchical drill-down analysis&lt;/li&gt;
&lt;li&gt;  Improves data integrity&lt;/li&gt;
&lt;li&gt;  Requires additional joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚖️ Advantages
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  ✔ Reduced redundancy&lt;/li&gt;
&lt;li&gt;  ✔ Better data consistency&lt;/li&gt;
&lt;li&gt;  ✔ Supports complex hierarchies&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ⚠️ Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  More complex design&lt;/li&gt;
&lt;li&gt;  Slower performance due to joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📊 Fact Tables vs Dimension Tables
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📊 Fact Tables
&lt;/h3&gt;

&lt;p&gt;Store numeric metrics and foreign keys linking to dimensions.&lt;br&gt;
Usually contain transactional data and large volumes of records.&lt;/p&gt;

&lt;h3&gt;
  
  
  🧾 Dimension Tables
&lt;/h3&gt;

&lt;p&gt;Store descriptive attributes that provide context to fact tables.&lt;br&gt;
Used for filtering, grouping, and categorizing data.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔗 Relationships in Power BI
&lt;/h2&gt;

&lt;p&gt;Relationships connect tables and enable filtering across datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  📌 Types of Relationships
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;One-to-Many&lt;/strong&gt; -- One dimension record links to many fact records&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Many-to-One&lt;/strong&gt; -- Reverse of one-to-many&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Many-to-Many&lt;/strong&gt; -- Multiple matching records on both sides&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🚨 Why Good Data Modelling Is Critical
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ⚡ Performance Optimization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Improves query speed&lt;/li&gt;
&lt;li&gt;  Reduces memory usage&lt;/li&gt;
&lt;li&gt;  Enables faster dashboard loading&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎯 Accurate Reporting
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Ensures correct aggregations&lt;/li&gt;
&lt;li&gt;  Maintains reliable filter behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🧠 Easier DAX Calculations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Simplifies analytical queries&lt;/li&gt;
&lt;li&gt;  Improves calculation accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🔧 Scalability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Supports future data expansion&lt;/li&gt;
&lt;li&gt;  Easier troubleshooting and maintenance&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⭐ Star Schema vs Snowflake Schema
&lt;/h2&gt;

&lt;p&gt;Feature       Star Schema            Snowflake Schema&lt;/p&gt;




&lt;p&gt;Structure     Denormalized           Normalized&lt;br&gt;
  Performance   Faster queries         Slower queries&lt;br&gt;
  Complexity    Simple                 Complex&lt;br&gt;
  Storage       Higher storage         Lower storage&lt;br&gt;
  Use Case      Reporting dashboards   Large complex warehouses&lt;/p&gt;




&lt;h2&gt;
  
  
  🏆 Power BI Data Modelling Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Use Star Schema whenever possible&lt;/li&gt;
&lt;li&gt;  Separate fact and dimension tables&lt;/li&gt;
&lt;li&gt;  Maintain clear relationships&lt;/li&gt;
&lt;li&gt;  Optimize data types&lt;/li&gt;
&lt;li&gt;  Reduce unnecessary joins&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📌 Real-World Analogy
&lt;/h2&gt;

&lt;p&gt;Think of a library:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Fact tables&lt;/strong&gt; = Books (contain measurable information)&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Dimension tables&lt;/strong&gt; = Catalogue system (organizes and locates
books)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎯 Conclusion
&lt;/h2&gt;

&lt;p&gt;Data modelling is one of the most important skills for Power BI&lt;br&gt;
developers. Understanding schema design ensures dashboards are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Fast&lt;/li&gt;
&lt;li&gt;  Accurate&lt;/li&gt;
&lt;li&gt;  Scalable&lt;/li&gt;
&lt;li&gt;  Maintainable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before building visuals or writing DAX formulas, always ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Is my data model structured correctly?&lt;/p&gt;
&lt;/blockquote&gt;




</description>
      <category>datamodelling</category>
      <category>powerbi</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Getting Started With Linux for Data Engineers (With Vi and Nano Examples)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Thu, 05 Feb 2026 10:55:10 +0000</pubDate>
      <link>https://dev.to/john_analytics/getting-started-with-linux-for-data-engineers-with-vi-and-nano-examples-4m99</link>
      <guid>https://dev.to/john_analytics/getting-started-with-linux-for-data-engineers-with-vi-and-nano-examples-4m99</guid>
      <description>&lt;p&gt;If you're getting into data engineering, there's one skill that keeps&lt;br&gt;
showing up everywhere: &lt;strong&gt;Linux&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Whether you're working with cloud servers, big data tools, or pipeline&lt;br&gt;
automation, Linux is almost always running behind the scenes. The good&lt;br&gt;
news? You don't need to be a Linux wizard to get started.&lt;/p&gt;

&lt;p&gt;In this guide, we'll break down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Why data engineers need Linux&lt;/li&gt;
&lt;li&gt;  Basic commands you'll actually use&lt;/li&gt;
&lt;li&gt;  How to edit files using Vi&lt;/li&gt;
&lt;li&gt;  How to edit files using Nano&lt;/li&gt;
&lt;li&gt;  Real-world examples&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Why Should Data Engineers Learn Linux?
&lt;/h2&gt;

&lt;p&gt;Here's the honest truth --- most production data systems run on Linux&lt;br&gt;
servers.&lt;/p&gt;

&lt;p&gt;When you deploy Spark jobs, schedule Airflow pipelines, or manage&lt;br&gt;
databases, you'll likely connect to a Linux machine.&lt;/p&gt;
&lt;h3&gt;
  
  
  It's Built for Performance
&lt;/h3&gt;

&lt;p&gt;Linux handles heavy workloads really well, which is perfect for big data&lt;br&gt;
processing.&lt;/p&gt;
&lt;h3&gt;
  
  
  It's Highly Customizable
&lt;/h3&gt;

&lt;p&gt;Since Linux is open source, companies tailor it for their&lt;br&gt;
infrastructure.&lt;/p&gt;
&lt;h3&gt;
  
  
  It Runs the Cloud
&lt;/h3&gt;

&lt;p&gt;Most AWS, Azure, and Google Cloud servers run Linux.&lt;/p&gt;
&lt;h3&gt;
  
  
  It Supports Automation
&lt;/h3&gt;

&lt;p&gt;Data engineers constantly automate workflows using shell scripts.&lt;/p&gt;


&lt;h2&gt;
  
  
  Linux Commands Every Beginner Should Know
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Check Where You Are
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;pwd&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  List Files
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Move Between Folders
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;folder_name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Create a Folder
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;data_project
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Create a File
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;notes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Read a File
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;notes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Text Editors Matter in Linux
&lt;/h2&gt;

&lt;p&gt;When you log into a server, there's usually no graphical editor like VS&lt;br&gt;
Code or Notepad.&lt;/p&gt;

&lt;p&gt;Instead, you use terminal editors like: - Vi (powerful but tricky) -&lt;br&gt;
Nano (simple and beginner-friendly)&lt;/p&gt;


&lt;h2&gt;
  
  
  Using Vi (The Power Tool)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Open or Create a File
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vi sample.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Enter Insert Mode
&lt;/h3&gt;

&lt;p&gt;Press &lt;code&gt;i&lt;/code&gt; and start typing.&lt;/p&gt;
&lt;h3&gt;
  
  
  Save and Exit
&lt;/h3&gt;

&lt;p&gt;Press &lt;code&gt;ESC&lt;/code&gt;, then type:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;:wq
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Exit Without Saving
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;:q!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Example Script
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vi pipeline.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Pipeline started"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Using Nano (The Friendly Editor)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Open a File
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano notes.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Save Your Work
&lt;/h3&gt;

&lt;p&gt;Press:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CTRL + O
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Exit Nano
&lt;/h3&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CTRL + X
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Example Config
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano config.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;database=postgres
username=admin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;h2&gt;
  
  
  Real-Life Data Engineering Scenario
&lt;/h2&gt;

&lt;p&gt;You may need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Update Airflow configuration&lt;/li&gt;
&lt;li&gt;  Fix pipeline scripts&lt;/li&gt;
&lt;li&gt;  Modify database credentials&lt;/li&gt;
&lt;li&gt;  Check logs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Commands might include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nano airflow.cfg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vi pipeline.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Pro Tips for Beginners
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Always back up files before editing&lt;/li&gt;
&lt;li&gt;  Practice Vi commands slowly&lt;/li&gt;
&lt;li&gt;  Use Nano when learning&lt;/li&gt;
&lt;li&gt;  Learn basic shell commands daily&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Linux is part of the foundation of modern data infrastructure.&lt;/p&gt;

&lt;p&gt;Learning Linux commands and text editors gives you confidence when&lt;br&gt;
working with production servers and cloud platforms.&lt;/p&gt;

&lt;p&gt;Start with Nano.&lt;br&gt;
Grow into Vi.&lt;br&gt;
Practice consistently.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>dataengineering</category>
      <category>tutorial</category>
      <category>data</category>
    </item>
    <item>
      <title>Real-World ETL Pipeline from a Public Google Sheet</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Wed, 04 Feb 2026 08:52:02 +0000</pubDate>
      <link>https://dev.to/john_analytics/real-world-etl-pipeline-from-a-public-google-sheet-7cn</link>
      <guid>https://dev.to/john_analytics/real-world-etl-pipeline-from-a-public-google-sheet-7cn</guid>
      <description>&lt;p&gt;Spreadsheets are everywhere.&lt;/p&gt;

&lt;p&gt;They’re easy to use, easy to share, and often become the &lt;strong&gt;first home of business data&lt;/strong&gt;. But they’re terrible for analytics, automation, and scale.&lt;/p&gt;

&lt;p&gt;In this article, I’ll walk through how I built a &lt;strong&gt;production-style ETL pipeline&lt;/strong&gt; that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extracts data from a &lt;strong&gt;public Google Sheet&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Cleans and validates the data using Python&lt;/li&gt;
&lt;li&gt;Loads the data into &lt;strong&gt;PostgreSQL&lt;/strong&gt; and &lt;strong&gt;MongoDB&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Handles real-world issues like UUIDs, connection strings, and performance bottlenecks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A supermarket dataset was stored in a Google Sheet. While this works for manual inspection, it introduces several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No schema enforcement&lt;/li&gt;
&lt;li&gt;No support for analytics or BI tools&lt;/li&gt;
&lt;li&gt;Poor performance for large queries&lt;/li&gt;
&lt;li&gt;No safe way to integrate with applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was to move this data into proper databases while following &lt;strong&gt;real ETL best practices&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Public Google Sheet (CSV Export)
        ↓
     Python ETL
        ↓
  Transform &amp;amp; Validate
        ↓
 PostgreSQL (Analytics)   MongoDB (Documents)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why two databases?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; acts as the system of record for analytics and reporting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; provides flexible, document-based storage for application access&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Tools Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Python 3.12&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UV&lt;/strong&gt; for dependency and environment management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pandas&lt;/strong&gt; for data transformation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Requests&lt;/strong&gt; for HTTP-based extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; (via SQLAlchemy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; (via PyMongo)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loguru&lt;/strong&gt; for structured logging&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Extracting Data from a Public Google Sheet
&lt;/h2&gt;

&lt;p&gt;Instead of dealing with Google Cloud authentication, I used a simpler (and very realistic) approach.&lt;/p&gt;

&lt;p&gt;Google Sheets exposes a &lt;strong&gt;CSV export endpoint&lt;/strong&gt; for public sheets.&lt;/p&gt;

&lt;p&gt;A human-friendly link like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://docs.google.com/spreadsheets/d/&amp;lt;sheet-id&amp;gt;/edit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can be converted to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://docs.google.com/spreadsheets/d/&amp;lt;sheet-id&amp;gt;/export?format=csv&amp;amp;gid=&amp;lt;gid&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Python can then fetch the data directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StringIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API keys. No OAuth. Fully automated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Transforming the Data (Where Things Get Real)
&lt;/h2&gt;

&lt;p&gt;This is where assumptions break.&lt;/p&gt;

&lt;p&gt;I initially assumed the &lt;code&gt;id&lt;/code&gt; column was numeric. It wasn’t.&lt;/p&gt;

&lt;p&gt;It contained &lt;strong&gt;UUIDs&lt;/strong&gt; like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;47d54138-a950-4ec0-9d4a-e637e8dfb290
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trying to cast this to an integer caused the pipeline to fail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson #1: The data always wins
&lt;/h3&gt;

&lt;p&gt;The fix was simple but important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat &lt;code&gt;id&lt;/code&gt; as a &lt;strong&gt;string&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Update both transformation logic and database schemas&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 3: Loading into PostgreSQL
&lt;/h2&gt;

&lt;p&gt;PostgreSQL is the backbone of the pipeline.&lt;/p&gt;

&lt;p&gt;Key design decisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong schema enforcement&lt;/li&gt;
&lt;li&gt;Idempotent inserts&lt;/li&gt;
&lt;li&gt;Safe re-runs of the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The table is created if it doesn’t exist, and inserts use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;NOTHING&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No duplicate records&lt;/li&gt;
&lt;li&gt;No need to truncate tables&lt;/li&gt;
&lt;li&gt;Safe incremental runs&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 4: Loading into MongoDB (and Fixing Performance)
&lt;/h2&gt;

&lt;p&gt;My first MongoDB implementation used &lt;code&gt;update_one()&lt;/code&gt; in a loop.&lt;/p&gt;

&lt;p&gt;It worked — but it was &lt;strong&gt;painfully slow&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;The fix was switching to &lt;strong&gt;bulk operations&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk_write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ordered&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This reduced load time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Debugging a Nasty PostgreSQL Error
&lt;/h2&gt;

&lt;p&gt;One of the most confusing errors I hit was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;could not translate host name "4401@localhost"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It turned out the PostgreSQL password contained an &lt;code&gt;@&lt;/code&gt; symbol.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lesson #3: Database passwords must be URL-safe
&lt;/h3&gt;

&lt;p&gt;The solution was &lt;strong&gt;URL encoding&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KIM@4401 → KIM%404401
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After fixing these issues, the pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs end-to-end.&lt;/li&gt;
&lt;li&gt;Can be safely re-run without duplicates&lt;/li&gt;
&lt;li&gt;Loads clean data into both databases&lt;/li&gt;
&lt;li&gt;Handles real-world data quirks correctly&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Spreadsheets are common sources — but not suitable destinations&lt;/li&gt;
&lt;li&gt;Never assume data types without inspecting real data&lt;/li&gt;
&lt;li&gt;UUIDs are extremely common in production systems&lt;/li&gt;
&lt;li&gt;Bulk operations matter for performance&lt;/li&gt;
&lt;li&gt;Environment variables and connection strings are frequent failure points&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project wasn’t about flashy tools.&lt;/p&gt;

&lt;p&gt;It was about building something &lt;strong&gt;real&lt;/strong&gt;, breaking it, and fixing it.&lt;/p&gt;




</description>
      <category>database</category>
      <category>dataengineering</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building a MedAdvantage RAF Engine with dbt &amp; PostgreSQL (Step-by-Step Guide)</title>
      <dc:creator>John Wakaba</dc:creator>
      <pubDate>Tue, 13 Jan 2026 10:09:54 +0000</pubDate>
      <link>https://dev.to/john_analytics/building-a-medadvantage-raf-engine-with-dbt-postgresql-step-by-step-guide-5fed</link>
      <guid>https://dev.to/john_analytics/building-a-medadvantage-raf-engine-with-dbt-postgresql-step-by-step-guide-5fed</guid>
      <description>&lt;p&gt;In this project, I built a mini &lt;strong&gt;Medicare Advantage Risk Adjustment Factor (RAF) engine&lt;/strong&gt; using &lt;strong&gt;PostgreSQL&lt;/strong&gt;, &lt;strong&gt;dbt Core&lt;/strong&gt;, and &lt;strong&gt;synthetic healthcare data&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
The goal was to simulate a real-world healthcare analytics pipeline that transforms &lt;strong&gt;raw claims data into member-level risk scores&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This article walks through the &lt;strong&gt;entire process step by step&lt;/strong&gt;, from raw CSV files to a final analytics mart.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Project Overview
&lt;/h2&gt;

&lt;p&gt;The project models how healthcare organizations calculate &lt;strong&gt;risk scores&lt;/strong&gt; using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Member demographic data
&lt;/li&gt;
&lt;li&gt;Medical claims with &lt;strong&gt;ICD-10 diagnosis codes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Pharmacy claims with &lt;strong&gt;NDC drug codes&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reference mapping tables that convert codes into &lt;strong&gt;HCC and RxHCC categories&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The final output is a table that shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Member ID
&lt;/li&gt;
&lt;li&gt;Service year
&lt;/li&gt;
&lt;li&gt;Gender and plan
&lt;/li&gt;
&lt;li&gt;Total HCC weight
&lt;/li&gt;
&lt;li&gt;Total RxHCC weight
&lt;/li&gt;
&lt;li&gt;Final risk score
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This type of table is commonly used for &lt;strong&gt;actuarial analysis, reimbursement modeling, and population health analytics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Tools &amp;amp; Environment Setup
&lt;/h2&gt;

&lt;p&gt;Before starting the project, I installed and configured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; as the data warehouse
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DBeaver&lt;/strong&gt; as the database client
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python + virtual environment (venv)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt Core with the Postgres adapter&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This setup allows a modern &lt;strong&gt;ELT (Extract → Load → Transform)&lt;/strong&gt; workflow where data is first loaded into the warehouse and then transformed using dbt.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Database Design
&lt;/h2&gt;

&lt;p&gt;I used one PostgreSQL database with &lt;strong&gt;two schemas&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;med_project&lt;/code&gt;&lt;/strong&gt; → holds raw data and reference tables
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;analytics&lt;/code&gt;&lt;/strong&gt; → holds all dbt models (staging, core, and marts)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Raw tables stored in &lt;code&gt;med_project&lt;/code&gt;:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Members
&lt;/li&gt;
&lt;li&gt;Medical claims
&lt;/li&gt;
&lt;li&gt;Pharmacy claims
&lt;/li&gt;
&lt;li&gt;ICD-to-HCC mapping table
&lt;/li&gt;
&lt;li&gt;NDC-to-RxHCC mapping table
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  dbt models stored in &lt;code&gt;analytics&lt;/code&gt;:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Staging models
&lt;/li&gt;
&lt;li&gt;Core models
&lt;/li&gt;
&lt;li&gt;Mart models (final analytics tables)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation keeps &lt;strong&gt;raw data immutable&lt;/strong&gt; and &lt;strong&gt;transformed data fully governed by dbt&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Loading the Raw Data
&lt;/h2&gt;

&lt;p&gt;Synthetic CSV files were generated and loaded into PostgreSQL using &lt;strong&gt;DBeaver’s Import Tool&lt;/strong&gt;.  &lt;/p&gt;

&lt;p&gt;Each CSV corresponded to one raw table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Members
&lt;/li&gt;
&lt;li&gt;Medical claims
&lt;/li&gt;
&lt;li&gt;Pharmacy claims
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All raw table columns were stored as &lt;strong&gt;TEXT&lt;/strong&gt; initially. This prevents ingestion failures during loading and allows all data type enforcement to be handled inside dbt.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Initializing the dbt Project
&lt;/h2&gt;

&lt;p&gt;A new dbt project was created inside a Python virtual environment.&lt;br&gt;&lt;br&gt;
During initialization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL was selected as the adapter
&lt;/li&gt;
&lt;li&gt;A profile was created in &lt;code&gt;profiles.yml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The default target schema was set to &lt;code&gt;analytics&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dbt connection was validated to ensure connectivity between dbt and PostgreSQL.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Registering Raw Tables as dbt Sources
&lt;/h2&gt;

&lt;p&gt;To allow dbt to reference raw tables safely, a &lt;strong&gt;source configuration file&lt;/strong&gt; was created.&lt;/p&gt;

&lt;p&gt;This file tells dbt:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which schema the raw tables are in
&lt;/li&gt;
&lt;li&gt;Which tables are considered &lt;strong&gt;authoritative raw sources&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Which tables act as &lt;strong&gt;reference data (HCC &amp;amp; RxHCC mappings)&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables consistent use of dbt sources and prevents hard-coding schema names inside models.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Building the Staging Layer
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;staging layer&lt;/strong&gt; is where all raw data is cleaned and standardized.&lt;/p&gt;

&lt;p&gt;At this stage, I performed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explicit parsing of U.S.-formatted dates
&lt;/li&gt;
&lt;li&gt;Numeric type conversions for amounts and quantities
&lt;/li&gt;
&lt;li&gt;Trimming and upper-casing of ICD and NDC codes
&lt;/li&gt;
&lt;li&gt;Basic null handling
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three staging models were created:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;stg_members&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stg_medical_claims&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stg_pharmacy_claims&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These models ensure that all downstream data has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consistent data types
&lt;/li&gt;
&lt;li&gt;Clean formats
&lt;/li&gt;
&lt;li&gt;Reliable date values
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Building the Core Layer
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;core layer&lt;/strong&gt; represents &lt;strong&gt;analytics-ready entities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here I focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating one clean row per member
&lt;/li&gt;
&lt;li&gt;Passing through only validated claim and pharmacy records
&lt;/li&gt;
&lt;li&gt;Preparing the data for aggregation
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Core models included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clean member dimension
&lt;/li&gt;
&lt;li&gt;Medical claim fact table
&lt;/li&gt;
&lt;li&gt;Pharmacy claim fact table
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer removes duplication and creates stable structures used by the marts.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Mapping Diagnoses to HCCs
&lt;/h2&gt;

&lt;p&gt;Medical claims were mapped to &lt;strong&gt;HCC categories&lt;/strong&gt; using the ICD-to-HCC reference table.&lt;/p&gt;

&lt;p&gt;At this stage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each member’s diagnosis codes were expanded
&lt;/li&gt;
&lt;li&gt;Codes were normalized
&lt;/li&gt;
&lt;li&gt;Valid diagnoses were matched to HCC categories
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output produces one record per &lt;strong&gt;member per year per HCC category&lt;/strong&gt;, with an associated &lt;strong&gt;risk weight&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Computing Final Member Risk Scores
&lt;/h2&gt;

&lt;p&gt;The final step combined:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Aggregated &lt;strong&gt;HCC weights&lt;/strong&gt; from medical claims
&lt;/li&gt;
&lt;li&gt;Aggregated &lt;strong&gt;RxHCC weights&lt;/strong&gt; from pharmacy claims
&lt;/li&gt;
&lt;li&gt;Member demographic attributes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each member and service year:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HCC weights were summed
&lt;/li&gt;
&lt;li&gt;RxHCC weights were summed
&lt;/li&gt;
&lt;li&gt;A base score was added
&lt;/li&gt;
&lt;li&gt;A final &lt;strong&gt;risk score&lt;/strong&gt; was computed
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This produced the final analytics mart: &lt;strong&gt;&lt;code&gt;member_risk_scores&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Results &amp;amp; Key Outcomes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A fully functional &lt;strong&gt;end-to-end healthcare analytics pipeline&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Clean transformation workflow from raw CSVs to final mart&lt;/li&gt;
&lt;li&gt;A production-style &lt;strong&gt;member-level risk score table&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data ready for reporting tools like Power BI or Tableau&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;dbt enforces strong modeling discipline through layered architecture
&lt;/li&gt;
&lt;li&gt;Reference mapping tables are the backbone of healthcare risk analytics
&lt;/li&gt;
&lt;li&gt;Explicit type casting prevents silent data quality issues
&lt;/li&gt;
&lt;li&gt;Separation of raw, staging, core, and marts ensures scalability and auditability
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  12. Key Challenges Faced &amp;amp; Resolved
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL CSV permission issues → Resolved by using client-side imports
&lt;/li&gt;
&lt;li&gt;Cross-database reference errors in dbt → Fixed by aligning the dbt database configuration
&lt;/li&gt;
&lt;li&gt;U.S. date format parsing errors → Solved by explicitly controlling date parsing in staging
&lt;/li&gt;
&lt;li&gt;Semicolon syntax errors in dbt models → Resolved by removing trailing semicolons
&lt;/li&gt;
&lt;li&gt;Source configuration mismatches → Fixed by correcting schema references
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how modern analytics engineering tools can simulate &lt;strong&gt;real Medicare Advantage risk modeling workflows&lt;/strong&gt; using open-source technologies.  &lt;/p&gt;

</description>
      <category>analytics</category>
      <category>dataengineering</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
