Neeraj Kumar

Posted on Feb 5

Day 2: Advanced SQL Preparation Guide

#dataengineering #sql

Objectives

Select and download suitable datasets
Set up PostgreSQL environment
Practice basic queries as foundation
Prepare for CTEs, window functions, and query optimization

Part 1: Dataset Selection

Recommended Datasets

Dataset 1: E-commerce Sales Data (Primary)

Source: Kaggle - "Superstore Sales Dataset"

URL: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final
Why it's ideal:
- Multiple tables (orders, customers, products)
- Time-series data (perfect for window functions)
- Sales metrics (great for aggregations and CTEs)
- ~10,000 rows (manageable size)

Key columns: Order ID, Customer ID, Product, Category, Sales, Profit, Quantity, Order Date, Ship Date

Advanced SQL use cases:

Running totals and moving averages (window functions)
Customer cohort analysis (CTEs)
Sales ranking by region/category (RANK, DENSE_RANK)
Year-over-year comparisons

Dataset 2: Employee/HR Data (Secondary)

Source: Kaggle - "HR Analytics Dataset"

URL: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
Why it's useful:
- Employee hierarchy (good for recursive CTEs)
- Salary and performance data
- Attrition tracking

Advanced SQL use cases:

Employee tenure analysis
Salary percentiles by department
Retention rate calculations
Manager-employee relationships

Alternative Dataset Options:

Option A: COVID-19 Data

Source: Our World in Data
URL: https://github.com/owid/covid-19-data/tree/master/public/data
Time-series perfect for window functions
Country comparisons

Option B: Airbnb Listings

Source: Inside Airbnb
URL: http://insideairbnb.com/get-the-data/
Price analysis, availability patterns
Geographic data for spatial queries

Part 2: PostgreSQL Setup

Installation

On macOS:

# Using Homebrew
brew install postgresql@15

# Start PostgreSQL service
brew services start postgresql@15

# Verify installation
psql --version

On Ubuntu/Debian:

# Update package list
sudo apt update

# Install PostgreSQL
sudo apt install postgresql postgresql-contrib

# Start service
sudo systemctl start postgresql
sudo systemctl enable postgresql

# Switch to postgres user
sudo -u postgres psql

On Windows:

Download installer from: https://www.postgresql.org/download/windows/
Run the installer (includes pgAdmin 4)
Remember your superuser password
Add PostgreSQL bin to PATH

Initial Configuration

-- Connect to PostgreSQL
psql -U postgres

-- Create a database for practice
CREATE DATABASE data_engineering_practice;

-- Connect to the new database
\c data_engineering_practice

-- Create a user (optional)
CREATE USER data_engineer WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE data_engineering_practice TO data_engineer;

Useful PostgreSQL Commands

-- List all databases
\l

-- List all tables in current database
\dt

-- Describe table structure
\d table_name

-- List all schemas
\dn

-- Quit psql
\q

-- Execute SQL from file
\i /path/to/file.sql

-- Show query execution time
\timing on

Part 3: Loading Your Data

Method 1: Using COPY (Fastest for CSV)

-- Create table structure first
CREATE TABLE superstore_sales (
    row_id SERIAL PRIMARY KEY,
    order_id VARCHAR(50),
    order_date DATE,
    ship_date DATE,
    ship_mode VARCHAR(50),
    customer_id VARCHAR(50),
    customer_name VARCHAR(100),
    segment VARCHAR(50),
    country VARCHAR(50),
    city VARCHAR(100),
    state VARCHAR(50),
    postal_code VARCHAR(20),
    region VARCHAR(50),
    product_id VARCHAR(50),
    category VARCHAR(50),
    sub_category VARCHAR(50),
    product_name VARCHAR(255),
    sales DECIMAL(10,2),
    quantity INTEGER,
    discount DECIMAL(4,2),
    profit DECIMAL(10,2)
);

-- Load CSV data
COPY superstore_sales(order_id, order_date, ship_date, ship_mode, 
                      customer_id, customer_name, segment, country, 
                      city, state, postal_code, region, product_id, 
                      category, sub_category, product_name, sales, 
                      quantity, discount, profit)
FROM '/path/to/superstore.csv'
DELIMITER ','
CSV HEADER;

Method 2: Using pgAdmin GUI

Right-click on table → Import/Export
Select file and format
Map columns
Import

Method 3: Using Python (for data cleaning)

import pandas as pd
from sqlalchemy import create_engine

# Read and clean data
df = pd.read_csv('superstore.csv')

# Connect to PostgreSQL
engine = create_engine('postgresql://username:password@localhost:5432/data_engineering_practice')

# Load to database
df.to_sql('superstore_sales', engine, if_exists='replace', index=False)

Part 4: Basic SQL Practice Queries

Query Set 1: Basic SELECT and Filtering

-- 1. View first 10 rows
SELECT * FROM superstore_sales LIMIT 10;

-- 2. Count total orders
SELECT COUNT(DISTINCT order_id) as total_orders 
FROM superstore_sales;

-- 3. Filter by category
SELECT * FROM superstore_sales 
WHERE category = 'Technology' 
LIMIT 20;

-- 4. Multiple conditions
SELECT customer_name, sales, profit 
FROM superstore_sales 
WHERE region = 'West' 
  AND sales > 1000 
  AND order_date >= '2017-01-01';

-- 5. Pattern matching
SELECT DISTINCT product_name 
FROM superstore_sales 
WHERE product_name LIKE '%Chair%';

Query Set 2: Aggregations

-- 6. Total sales by category
SELECT 
    category,
    SUM(sales) as total_sales,
    SUM(profit) as total_profit,
    COUNT(*) as order_count
FROM superstore_sales
GROUP BY category
ORDER BY total_sales DESC;

-- 7. Average discount by segment
SELECT 
    segment,
    AVG(discount) as avg_discount,
    MIN(discount) as min_discount,
    MAX(discount) as max_discount
FROM superstore_sales
GROUP BY segment;

-- 8. Monthly sales trend
SELECT 
    DATE_TRUNC('month', order_date) as month,
    SUM(sales) as monthly_sales,
    COUNT(DISTINCT order_id) as order_count
FROM superstore_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

-- 9. HAVING clause
SELECT 
    customer_name,
    SUM(sales) as total_spent
FROM superstore_sales
GROUP BY customer_name
HAVING SUM(sales) > 5000
ORDER BY total_spent DESC;

Query Set 3: JOIN Preparation

-- Create dimension tables for practice

-- Customer dimension
CREATE TABLE customers AS
SELECT DISTINCT
    customer_id,
    customer_name,
    segment,
    country,
    city,
    state,
    region
FROM superstore_sales;

-- Product dimension
CREATE TABLE products AS
SELECT DISTINCT
    product_id,
    product_name,
    category,
    sub_category
FROM superstore_sales;

-- Fact table (orders)
CREATE TABLE orders AS
SELECT 
    row_id,
    order_id,
    order_date,
    ship_date,
    ship_mode,
    customer_id,
    product_id,
    sales,
    quantity,
    discount,
    profit
FROM superstore_sales;

-- 10. INNER JOIN
SELECT 
    o.order_id,
    c.customer_name,
    p.product_name,
    o.sales
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN products p ON o.product_id = p.product_id
LIMIT 100;

-- 11. LEFT JOIN with NULL check
SELECT 
    c.customer_name,
    COUNT(o.order_id) as order_count,
    COALESCE(SUM(o.sales), 0) as total_sales
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY order_count DESC;

Query Set 4: Subqueries (Foundation for CTEs)

-- 12. Subquery in WHERE
SELECT * FROM superstore_sales
WHERE sales > (SELECT AVG(sales) FROM superstore_sales);

-- 13. Subquery in FROM
SELECT 
    category,
    avg_sales,
    CASE 
        WHEN avg_sales > overall_avg THEN 'Above Average'
        ELSE 'Below Average'
    END as performance
FROM (
    SELECT 
        category,
        AVG(sales) as avg_sales
    FROM superstore_sales
    GROUP BY category
) cat_avg
CROSS JOIN (
    SELECT AVG(sales) as overall_avg 
    FROM superstore_sales
) overall;

-- 14. Correlated subquery
SELECT 
    s1.category,
    s1.product_name,
    s1.sales
FROM superstore_sales s1
WHERE s1.sales = (
    SELECT MAX(s2.sales)
    FROM superstore_sales s2
    WHERE s2.category = s1.category
);

Part 5: Preview of Advanced Concepts

Window Functions Teaser

-- Running total of sales over time
SELECT 
    order_date,
    sales,
    SUM(sales) OVER (ORDER BY order_date) as running_total
FROM superstore_sales
ORDER BY order_date
LIMIT 50;

-- Rank products by sales within each category
SELECT 
    category,
    product_name,
    SUM(sales) as total_sales,
    RANK() OVER (PARTITION BY category ORDER BY SUM(sales) DESC) as rank_in_category
FROM superstore_sales
GROUP BY category, product_name
ORDER BY category, rank_in_category
LIMIT 30;

CTEs (Common Table Expressions) Teaser

-- Using WITH clause
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        SUM(sales) as total_sales
    FROM superstore_sales
    GROUP BY DATE_TRUNC('month', order_date)
)
SELECT 
    month,
    total_sales,
    LAG(total_sales) OVER (ORDER BY month) as previous_month,
    total_sales - LAG(total_sales) OVER (ORDER BY month) as month_over_month_change
FROM monthly_sales
ORDER BY month;

Part 6: Practice Exercises

Exercise 1: Data Exploration

How many unique customers are there?
What date range does the dataset cover?
Which region has the highest total sales?
What's the average order value?

Exercise 2: Business Questions

Find the top 10 customers by total profit
Which product category has the highest profit margin?
What's the average delivery time (ship_date - order_date)?
Which month had the highest sales in 2017?

Exercise 3: Complex Queries

Find customers who made purchases in all four regions
Calculate the percentage of sales each category contributes
Find products that were never discounted
Identify customers whose average order value is above the overall average

Exercise 4: Data Quality Checks

-- Check for NULL values
SELECT 
    COUNT(*) as total_rows,
    COUNT(order_id) as non_null_orders,
    COUNT(customer_id) as non_null_customers,
    COUNT(sales) as non_null_sales
FROM superstore_sales;

-- Check for duplicates
SELECT 
    order_id, 
    product_id, 
    COUNT(*) 
FROM superstore_sales
GROUP BY order_id, product_id
HAVING COUNT(*) > 1;

-- Check data ranges
SELECT 
    MIN(order_date) as earliest_order,
    MAX(order_date) as latest_order,
    MIN(sales) as min_sale,
    MAX(sales) as max_sale,
    AVG(sales) as avg_sale
FROM superstore_sales;

Part 7: Performance Tips

Creating Indexes

-- Create indexes on frequently queried columns
CREATE INDEX idx_order_date ON superstore_sales(order_date);
CREATE INDEX idx_customer_id ON superstore_sales(customer_id);
CREATE INDEX idx_category ON superstore_sales(category);
CREATE INDEX idx_region ON superstore_sales(region);

-- Composite index for common query patterns
CREATE INDEX idx_customer_date ON superstore_sales(customer_id, order_date);

EXPLAIN and Query Analysis

-- See query execution plan
EXPLAIN SELECT * FROM superstore_sales WHERE order_date > '2017-01-01';

-- See actual execution stats
EXPLAIN ANALYZE 
SELECT category, SUM(sales) 
FROM superstore_sales 
GROUP BY category;

Part 8: Next Steps Checklist

[ ] Install PostgreSQL
[ ] Download 1-2 datasets from recommended sources
[ ] Create database and load data
[ ] Complete all basic query exercises
[ ] Create indexes on key columns
[ ] Try the window function and CTE teasers
[ ] Document any questions for the advanced session

Resources

Documentation

PostgreSQL Official Docs: https://www.postgresql.org/docs/
PostgreSQL Tutorial: https://www.postgresqltutorial.com/
Mode Analytics SQL Tutorial: https://mode.com/sql-tutorial/

Practice Platforms

SQLZoo: https://sqlzoo.net/
LeetCode SQL: https://leetcode.com/problemset/database/
HackerRank SQL: https://www.hackerrank.com/domains/sql

Tools

pgAdmin: PostgreSQL GUI client
DBeaver: Universal database tool
DataGrip: JetBrains SQL IDE

Quick Reference Card

-- CONNECTION
psql -U username -d database_name

-- COMMON AGGREGATES
COUNT(), SUM(), AVG(), MIN(), MAX()

-- DATE FUNCTIONS
DATE_TRUNC('month', date_column)
EXTRACT(YEAR FROM date_column)
AGE(date1, date2)

-- STRING FUNCTIONS
CONCAT(), UPPER(), LOWER(), SUBSTRING()
LIKE, ILIKE (case-insensitive)

-- CONDITIONAL
CASE WHEN ... THEN ... ELSE ... END
COALESCE(value1, value2, default)
NULLIF(value1, value2)

-- GROUPING
GROUP BY, HAVING
GROUP BY ROLLUP, CUBE, GROUPING SETS

Good luck with your Day 2 preparation! 🚀