DEV Community

Cover image for Day 2: Advanced SQL Preparation Guide
Neeraj Kumar
Neeraj Kumar

Posted on

Day 2: Advanced SQL Preparation Guide

Objectives

  • Select and download suitable datasets
  • Set up PostgreSQL environment
  • Practice basic queries as foundation
  • Prepare for CTEs, window functions, and query optimization

Part 1: Dataset Selection

Recommended Datasets

Dataset 1: E-commerce Sales Data (Primary)

Source: Kaggle - "Superstore Sales Dataset"

Key columns: Order ID, Customer ID, Product, Category, Sales, Profit, Quantity, Order Date, Ship Date

Advanced SQL use cases:

  • Running totals and moving averages (window functions)
  • Customer cohort analysis (CTEs)
  • Sales ranking by region/category (RANK, DENSE_RANK)
  • Year-over-year comparisons

Dataset 2: Employee/HR Data (Secondary)

Source: Kaggle - "HR Analytics Dataset"

Advanced SQL use cases:

  • Employee tenure analysis
  • Salary percentiles by department
  • Retention rate calculations
  • Manager-employee relationships

Alternative Dataset Options:

Option A: COVID-19 Data

Option B: Airbnb Listings


Part 2: PostgreSQL Setup

Installation

On macOS:

# Using Homebrew
brew install postgresql@15

# Start PostgreSQL service
brew services start postgresql@15

# Verify installation
psql --version
Enter fullscreen mode Exit fullscreen mode

On Ubuntu/Debian:

# Update package list
sudo apt update

# Install PostgreSQL
sudo apt install postgresql postgresql-contrib

# Start service
sudo systemctl start postgresql
sudo systemctl enable postgresql

# Switch to postgres user
sudo -u postgres psql
Enter fullscreen mode Exit fullscreen mode

On Windows:

  1. Download installer from: https://www.postgresql.org/download/windows/
  2. Run the installer (includes pgAdmin 4)
  3. Remember your superuser password
  4. Add PostgreSQL bin to PATH

Initial Configuration

-- Connect to PostgreSQL
psql -U postgres

-- Create a database for practice
CREATE DATABASE data_engineering_practice;

-- Connect to the new database
\c data_engineering_practice

-- Create a user (optional)
CREATE USER data_engineer WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE data_engineering_practice TO data_engineer;
Enter fullscreen mode Exit fullscreen mode

Useful PostgreSQL Commands

-- List all databases
\l

-- List all tables in current database
\dt

-- Describe table structure
\d table_name

-- List all schemas
\dn

-- Quit psql
\q

-- Execute SQL from file
\i /path/to/file.sql

-- Show query execution time
\timing on
Enter fullscreen mode Exit fullscreen mode

Part 3: Loading Your Data

Method 1: Using COPY (Fastest for CSV)

-- Create table structure first
CREATE TABLE superstore_sales (
    row_id SERIAL PRIMARY KEY,
    order_id VARCHAR(50),
    order_date DATE,
    ship_date DATE,
    ship_mode VARCHAR(50),
    customer_id VARCHAR(50),
    customer_name VARCHAR(100),
    segment VARCHAR(50),
    country VARCHAR(50),
    city VARCHAR(100),
    state VARCHAR(50),
    postal_code VARCHAR(20),
    region VARCHAR(50),
    product_id VARCHAR(50),
    category VARCHAR(50),
    sub_category VARCHAR(50),
    product_name VARCHAR(255),
    sales DECIMAL(10,2),
    quantity INTEGER,
    discount DECIMAL(4,2),
    profit DECIMAL(10,2)
);

-- Load CSV data
COPY superstore_sales(order_id, order_date, ship_date, ship_mode, 
                      customer_id, customer_name, segment, country, 
                      city, state, postal_code, region, product_id, 
                      category, sub_category, product_name, sales, 
                      quantity, discount, profit)
FROM '/path/to/superstore.csv'
DELIMITER ','
CSV HEADER;
Enter fullscreen mode Exit fullscreen mode

Method 2: Using pgAdmin GUI

  1. Right-click on table → Import/Export
  2. Select file and format
  3. Map columns
  4. Import

Method 3: Using Python (for data cleaning)

import pandas as pd
from sqlalchemy import create_engine

# Read and clean data
df = pd.read_csv('superstore.csv')

# Connect to PostgreSQL
engine = create_engine('postgresql://username:password@localhost:5432/data_engineering_practice')

# Load to database
df.to_sql('superstore_sales', engine, if_exists='replace', index=False)
Enter fullscreen mode Exit fullscreen mode

Part 4: Basic SQL Practice Queries

Query Set 1: Basic SELECT and Filtering

-- 1. View first 10 rows
SELECT * FROM superstore_sales LIMIT 10;

-- 2. Count total orders
SELECT COUNT(DISTINCT order_id) as total_orders 
FROM superstore_sales;

-- 3. Filter by category
SELECT * FROM superstore_sales 
WHERE category = 'Technology' 
LIMIT 20;

-- 4. Multiple conditions
SELECT customer_name, sales, profit 
FROM superstore_sales 
WHERE region = 'West' 
  AND sales > 1000 
  AND order_date >= '2017-01-01';

-- 5. Pattern matching
SELECT DISTINCT product_name 
FROM superstore_sales 
WHERE product_name LIKE '%Chair%';
Enter fullscreen mode Exit fullscreen mode

Query Set 2: Aggregations

-- 6. Total sales by category
SELECT 
    category,
    SUM(sales) as total_sales,
    SUM(profit) as total_profit,
    COUNT(*) as order_count
FROM superstore_sales
GROUP BY category
ORDER BY total_sales DESC;

-- 7. Average discount by segment
SELECT 
    segment,
    AVG(discount) as avg_discount,
    MIN(discount) as min_discount,
    MAX(discount) as max_discount
FROM superstore_sales
GROUP BY segment;

-- 8. Monthly sales trend
SELECT 
    DATE_TRUNC('month', order_date) as month,
    SUM(sales) as monthly_sales,
    COUNT(DISTINCT order_id) as order_count
FROM superstore_sales
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

-- 9. HAVING clause
SELECT 
    customer_name,
    SUM(sales) as total_spent
FROM superstore_sales
GROUP BY customer_name
HAVING SUM(sales) > 5000
ORDER BY total_spent DESC;
Enter fullscreen mode Exit fullscreen mode

Query Set 3: JOIN Preparation

-- Create dimension tables for practice

-- Customer dimension
CREATE TABLE customers AS
SELECT DISTINCT
    customer_id,
    customer_name,
    segment,
    country,
    city,
    state,
    region
FROM superstore_sales;

-- Product dimension
CREATE TABLE products AS
SELECT DISTINCT
    product_id,
    product_name,
    category,
    sub_category
FROM superstore_sales;

-- Fact table (orders)
CREATE TABLE orders AS
SELECT 
    row_id,
    order_id,
    order_date,
    ship_date,
    ship_mode,
    customer_id,
    product_id,
    sales,
    quantity,
    discount,
    profit
FROM superstore_sales;

-- 10. INNER JOIN
SELECT 
    o.order_id,
    c.customer_name,
    p.product_name,
    o.sales
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN products p ON o.product_id = p.product_id
LIMIT 100;

-- 11. LEFT JOIN with NULL check
SELECT 
    c.customer_name,
    COUNT(o.order_id) as order_count,
    COALESCE(SUM(o.sales), 0) as total_sales
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY order_count DESC;
Enter fullscreen mode Exit fullscreen mode

Query Set 4: Subqueries (Foundation for CTEs)

-- 12. Subquery in WHERE
SELECT * FROM superstore_sales
WHERE sales > (SELECT AVG(sales) FROM superstore_sales);

-- 13. Subquery in FROM
SELECT 
    category,
    avg_sales,
    CASE 
        WHEN avg_sales > overall_avg THEN 'Above Average'
        ELSE 'Below Average'
    END as performance
FROM (
    SELECT 
        category,
        AVG(sales) as avg_sales
    FROM superstore_sales
    GROUP BY category
) cat_avg
CROSS JOIN (
    SELECT AVG(sales) as overall_avg 
    FROM superstore_sales
) overall;

-- 14. Correlated subquery
SELECT 
    s1.category,
    s1.product_name,
    s1.sales
FROM superstore_sales s1
WHERE s1.sales = (
    SELECT MAX(s2.sales)
    FROM superstore_sales s2
    WHERE s2.category = s1.category
);
Enter fullscreen mode Exit fullscreen mode

Part 5: Preview of Advanced Concepts

Window Functions Teaser

-- Running total of sales over time
SELECT 
    order_date,
    sales,
    SUM(sales) OVER (ORDER BY order_date) as running_total
FROM superstore_sales
ORDER BY order_date
LIMIT 50;

-- Rank products by sales within each category
SELECT 
    category,
    product_name,
    SUM(sales) as total_sales,
    RANK() OVER (PARTITION BY category ORDER BY SUM(sales) DESC) as rank_in_category
FROM superstore_sales
GROUP BY category, product_name
ORDER BY category, rank_in_category
LIMIT 30;
Enter fullscreen mode Exit fullscreen mode

CTEs (Common Table Expressions) Teaser

-- Using WITH clause
WITH monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) as month,
        SUM(sales) as total_sales
    FROM superstore_sales
    GROUP BY DATE_TRUNC('month', order_date)
)
SELECT 
    month,
    total_sales,
    LAG(total_sales) OVER (ORDER BY month) as previous_month,
    total_sales - LAG(total_sales) OVER (ORDER BY month) as month_over_month_change
FROM monthly_sales
ORDER BY month;
Enter fullscreen mode Exit fullscreen mode

Part 6: Practice Exercises

Exercise 1: Data Exploration

  1. How many unique customers are there?
  2. What date range does the dataset cover?
  3. Which region has the highest total sales?
  4. What's the average order value?

Exercise 2: Business Questions

  1. Find the top 10 customers by total profit
  2. Which product category has the highest profit margin?
  3. What's the average delivery time (ship_date - order_date)?
  4. Which month had the highest sales in 2017?

Exercise 3: Complex Queries

  1. Find customers who made purchases in all four regions
  2. Calculate the percentage of sales each category contributes
  3. Find products that were never discounted
  4. Identify customers whose average order value is above the overall average

Exercise 4: Data Quality Checks

-- Check for NULL values
SELECT 
    COUNT(*) as total_rows,
    COUNT(order_id) as non_null_orders,
    COUNT(customer_id) as non_null_customers,
    COUNT(sales) as non_null_sales
FROM superstore_sales;

-- Check for duplicates
SELECT 
    order_id, 
    product_id, 
    COUNT(*) 
FROM superstore_sales
GROUP BY order_id, product_id
HAVING COUNT(*) > 1;

-- Check data ranges
SELECT 
    MIN(order_date) as earliest_order,
    MAX(order_date) as latest_order,
    MIN(sales) as min_sale,
    MAX(sales) as max_sale,
    AVG(sales) as avg_sale
FROM superstore_sales;
Enter fullscreen mode Exit fullscreen mode

Part 7: Performance Tips

Creating Indexes

-- Create indexes on frequently queried columns
CREATE INDEX idx_order_date ON superstore_sales(order_date);
CREATE INDEX idx_customer_id ON superstore_sales(customer_id);
CREATE INDEX idx_category ON superstore_sales(category);
CREATE INDEX idx_region ON superstore_sales(region);

-- Composite index for common query patterns
CREATE INDEX idx_customer_date ON superstore_sales(customer_id, order_date);
Enter fullscreen mode Exit fullscreen mode

EXPLAIN and Query Analysis

-- See query execution plan
EXPLAIN SELECT * FROM superstore_sales WHERE order_date > '2017-01-01';

-- See actual execution stats
EXPLAIN ANALYZE 
SELECT category, SUM(sales) 
FROM superstore_sales 
GROUP BY category;
Enter fullscreen mode Exit fullscreen mode

Part 8: Next Steps Checklist

  • [ ] Install PostgreSQL
  • [ ] Download 1-2 datasets from recommended sources
  • [ ] Create database and load data
  • [ ] Complete all basic query exercises
  • [ ] Create indexes on key columns
  • [ ] Try the window function and CTE teasers
  • [ ] Document any questions for the advanced session

Resources

Documentation

Practice Platforms

Tools

  • pgAdmin: PostgreSQL GUI client
  • DBeaver: Universal database tool
  • DataGrip: JetBrains SQL IDE

Quick Reference Card

-- CONNECTION
psql -U username -d database_name

-- COMMON AGGREGATES
COUNT(), SUM(), AVG(), MIN(), MAX()

-- DATE FUNCTIONS
DATE_TRUNC('month', date_column)
EXTRACT(YEAR FROM date_column)
AGE(date1, date2)

-- STRING FUNCTIONS
CONCAT(), UPPER(), LOWER(), SUBSTRING()
LIKE, ILIKE (case-insensitive)

-- CONDITIONAL
CASE WHEN ... THEN ... ELSE ... END
COALESCE(value1, value2, default)
NULLIF(value1, value2)

-- GROUPING
GROUP BY, HAVING
GROUP BY ROLLUP, CUBE, GROUPING SETS
Enter fullscreen mode Exit fullscreen mode

Good luck with your Day 2 preparation! 🚀

Top comments (0)