DEV Community

Cover image for Study Notes 1.2.6: SQL Refresher
Pizofreude
Pizofreude

Posted on

Study Notes 1.2.6: SQL Refresher

1. Introduction

  • This session covers SQL basics and is part of a series on Docker and SQL.
  • A taxi zone lookup file has been downloaded and loaded into a Postgres database as a table named zones.
  • The table contains various zones, including locations in Manhattan, Brooklyn, Queens, and Newark Airport.
  • The objective is to refresh SQL basics that will be useful later in the course, particularly for analytics engineering and dbt.

2. SQL Queries and Joins

Basic Query Execution

  • Standard way of counting rows:
    This is similar to wc -l <dataset.csv> which returns total number of lines in the dataset.csv.

    SELECT COUNT(*)
    FROM table_name
    
  • The yellow taxi trips dataset is queried using:

    SELECT
        *
    FROM
        yellow_taxi_trips
    LIMIT 100;
    
    
  • The dataset contains columns like:

    • pickup_time
    • dropoff_time
    • pu_location_id (pickup location ID)
    • do_location_id (drop-off location ID)
  • Objective: Replace location IDs with actual zone names from the zones table.

INNER JOIN (Matching Records Only)

  • The yellow_taxi_trips table is joined with the zones table to fetch zone names:

    SELECT t.pickup_time, t.dropoff_time, t.total_amount,
           zp.borough AS pickup_borough, zp.zone AS pickup_zone,
           zd.borough AS dropoff_borough, zd.zone AS dropoff_zone
    FROM yellow_taxi_trips t
    JOIN zones zp ON t.pu_location_id = zp.location_id
    JOIN zones zd ON t.do_location_id = zd.location_id;
    
    
  • Explanation:

    • JOIN ensures only records where IDs match in both tables are included.
    • zones table is aliased twice (zp for pickup zones, zd for drop-off zones).
    • The result replaces location IDs with actual zone names.
  • In our case, here’s the implicit inner join:

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc",
        CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc"   
    FROM
        yellow_taxi_data t,
        taxi_zones zpu,
        taxi_zones zdo
    WHERE
        t."PULocationID" = zpu."LocationID" AND
        t."DOLocationID" = zdo."LocationID"
    LIMIT 100;
    

Alternative Approach Using WHERE Clause (Older Method)

  • An alternative way to achieve the same result using a WHERE clause:

    SELECT t.*, zp.zone AS pickup_zone, zd.zone AS dropoff_zone
    FROM yellow_taxi_trips t, zones zp, zones zd
    WHERE t.pu_location_id = zp.location_id
    AND t.do_location_id = zd.location_id;
    
    
  • Drawback: This approach is less readable and less commonly used than explicit JOIN syntax.

  • Here’s the explicit inner join in our case: Notice no WHERE clause is needed.

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        CONCAT(zpu."Borough", ' / ', zpu."Zone") AS "pickup_loc",
        CONCAT(zdo."Borough", ' / ', zdo."Zone") AS "dropoff_loc"   
    FROM
        yellow_taxi_data t JOIN taxi_zones zpu
            ON t."PULocationID" = zpu."LocationID"
        JOIN taxi_zones zdo
            ON t."DOLocationID" = zdo."LocationID"
    
    LIMIT 100;
    

Checking for Missing Pickup/Dropoff Locations

  • To check if any taxi trips have NULL location IDs:

    SELECT * FROM yellow_taxi_trips WHERE pu_location_id IS NULL;
    SELECT * FROM yellow_taxi_trips WHERE do_location_id IS NULL;
    
    

    In our case:

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        "PULocationID",
        "DOLocationID"
    FROM
        yellow_taxi_data t
    WHERE
        "PULocationID" is NULL AND
        "DOLocationID" is NULL
    LIMIT 100;
    
  • To verify if all location IDs exist in the zones table:

    SELECT pu_location_id FROM yellow_taxi_trips
    WHERE pu_location_id NOT IN (SELECT location_id FROM zones);
    
    
    • If no records are returned, all locations are valid.

    In our case:

    SELECT
        tpep_pickup_datetime,
        tpep_dropoff_datetime,
        total_amount,
        "PULocationID",
        "DOLocationID"
    FROM
        yellow_taxi_data t
    WHERE
        "PULocationID" NOT IN (SELECT "LocationID" FROM taxi_zones) AND
        "DOLocationID" NOT IN (SELECT "LocationID" FROM taxi_zones)
    LIMIT 100;
    

3. Outer Joins (Handling Missing Data)

LEFT JOIN (Keeping Unmatched Records from the Left Table)

  • If a location ID exists in yellow_taxi_trips but not in zones, a normal join would exclude that row.
  • A LEFT JOIN ensures these rows are included, with NULL values for missing zone data:

    SELECT t.pickup_time, t.dropoff_time, t.total_amount,
           COALESCE(zp.zone, 'Unknown') AS pickup_zone,
           COALESCE(zd.zone, 'Unknown') AS dropoff_zone
    FROM yellow_taxi_trips t
    LEFT JOIN zones zp ON t.pu_location_id = zp.location_id
    LEFT JOIN zones zd ON t.do_location_id = zd.location_id;
    
    
  • COALESCE replaces NULL values with 'Unknown'.

RIGHT JOIN and FULL OUTER JOIN

  • RIGHT JOIN: Keeps all records from the zones table, even if there are no matching trips.
  • FULL OUTER JOIN: Keeps all records from both tables, filling missing values with NULLs.

4. Aggregation Using GROUP BY

Counting Trips Per Day

  • To group taxi trips by day and count them:

    SELECT DATE_TRUNC('day', dropoff_time) AS trip_day, COUNT(*) AS trip_count
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY trip_day ASC;
    
    
  • Alternative approach using CAST:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY trip_day ASC;
    
    

Finding the Busiest Day

  • Sorting by count to find the day with the most trips:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, COUNT(*) AS trip_count
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY trip_count DESC
    LIMIT 1;
    
    
  • Example result: January 28th was the busiest day.

Finding Maximum Revenue Per Day

  • To check highest fare per trip per day:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, MAX(total_amount) AS max_fare
    FROM yellow_taxi_trips
    GROUP BY trip_day
    ORDER BY max_fare DESC;
    
    
  • Some drivers made over $1000 in a single trip.

Grouping by Multiple Columns

  • To group by both day and district:

    SELECT CAST(dropoff_time AS DATE) AS trip_day, zd.borough, COUNT(*) AS trip_count
    FROM yellow_taxi_trips t
    JOIN zones zd ON t.do_location_id = zd.location_id
    GROUP BY trip_day, zd.borough
    ORDER BY trip_day ASC, zd.borough ASC;
    
    
  • Instead of listing each column, use positional reference:

    GROUP BY 1, 2;
    
    

5. Summary and Key Takeaways

  • SQL Joins are crucial for combining datasets and replacing foreign keys with meaningful data.
  • INNER JOIN returns only matching records, while LEFT JOIN preserves unmatched records.
  • GROUP BY is essential for aggregations and analytics, allowing operations like counting trips per day and finding peak revenue days.
  • These SQL techniques will be essential in later topics, such as analytics engineering (dbt).
  • The session also touched on data ingestion from CSV into Postgres, which is relevant for future lessons.

Postgres on Neon - Get the Free Plan

No credit card required. The database you love, on a serverless platform designed to help you build faster.

Get Postgres on Neon

Top comments (0)

Heroku

This site is powered by Heroku

Heroku was created by developers, for developers. Get started today and find out why Heroku has been the platform of choice for brands like DEV for over a decade.

Sign Up