Akhilesh

Posted on May 1

Joins: Combining Tables Without Losing Your Mind

#ai #beginners #python #sql

You have two tables.

The customers table knows who people are. Name, city, contact details.

The orders table knows what they bought. Product, amount, date.

Neither table alone answers the question: which customers in Mumbai spent more than 50,000 rupees last month?

You need both tables combined through their shared key. That combination is a join.

Joins are where SQL becomes genuinely powerful. The ability to link tables that were designed separately and query them as if they were one is what makes relational databases so useful for data analysis.

The Setup

import sqlite3
import pandas as pd

conn = sqlite3.connect("joins_practice.db")

conn.executescript("""
    DROP TABLE IF EXISTS customers;
    DROP TABLE IF EXISTS orders;
    DROP TABLE IF EXISTS products;

    CREATE TABLE customers (
        customer_id  INTEGER PRIMARY KEY,
        name         TEXT,
        city         TEXT,
        tier         TEXT
    );

    CREATE TABLE orders (
        order_id     INTEGER PRIMARY KEY,
        customer_id  INTEGER,
        product_id   INTEGER,
        amount       REAL,
        order_date   TEXT,
        status       TEXT
    );

    CREATE TABLE products (
        product_id   INTEGER PRIMARY KEY,
        product_name TEXT,
        category     TEXT,
        cost_price   REAL
    );

    INSERT INTO customers VALUES
        (1, 'Alex',   'Mumbai',    'Gold'),
        (2, 'Priya',  'Delhi',     'Silver'),
        (3, 'Sam',    'Bangalore', 'Bronze'),
        (4, 'Jordan', 'Mumbai',    'Gold'),
        (5, 'Lisa',   'Chennai',   'Silver'),
        (6, 'Ravi',   'Delhi',     'Bronze');

    INSERT INTO products VALUES
        (101, 'Laptop',     'Electronics', 50000),
        (102, 'Phone',      'Electronics', 28000),
        (103, 'Headphones', 'Electronics',  1800),
        (104, 'Shirt',      'Clothing',      400),
        (105, 'Watch',      'Accessories',  6000);

    INSERT INTO orders VALUES
        (1,  1,   101, 75000, '2024-01-05', 'completed'),
        (2,  2,   102, 45000, '2024-01-08', 'completed'),
        (3,  1,   103,  3500, '2024-01-12', 'completed'),
        (4,  3,   104,   800, '2024-01-15', 'completed'),
        (5,  4,   101, 75000, '2024-01-18', 'completed'),
        (6,  2,   104,  1500, '2024-01-20', 'completed'),
        (7,  1,   105, 12000, '2024-02-01', 'completed'),
        (8,  7,   102, 45000, '2024-02-05', 'completed'),
        (9,  3,   105, 12000, '2024-02-08', 'completed'),
        (10, 4,   104,   800, '2024-02-10', 'completed');
""")
conn.commit()

def q(sql):
    return pd.read_sql_query(sql, conn)

print("Three tables created.")
print(q("SELECT * FROM customers"))
print(q("SELECT * FROM orders"))

Notice order 8 has customer_id = 7 which does not exist in the customers table. Customer 5 (Lisa) and customer 6 (Ravi) have no orders. These edge cases will show you exactly how different join types behave.

INNER JOIN: Only the Matches

Inner join returns rows that have a matching key in both tables. Unmatched rows from either side are dropped.

result = q("""
    SELECT
        c.name,
        c.city,
        o.order_id,
        o.amount,
        o.order_date
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    ORDER BY o.order_date;
""")
print(result)
print(f"\nRows: {len(result)}")

Output:

     name       city  order_id   amount  order_date
0    Alex     Mumbai         1  75000.0  2024-01-05
1   Priya      Delhi         2  45000.0  2024-01-08
2    Alex     Mumbai         3   3500.0  2024-01-12
3     Sam  Bangalore         4    800.0  2024-01-15
4  Jordan     Mumbai         5  75000.0  2024-01-18
5   Priya      Delhi         6   1500.0  2024-01-20
6    Alex     Mumbai         7  12000.0  2024-02-01
7     Sam  Bangalore         9  12000.0  2024-02-08
8  Jordan     Mumbai        10    800.0  2024-02-10

Rows: 9

Order 8 (customer_id 7, no matching customer) is gone. Lisa and Ravi who have no orders are gone. Only the nine orders with valid matching customers appear.

The ON o.customer_id = c.customer_id clause is the join condition. It tells SQL which columns to use as the connecting key.

Aliases o and c shorten the table names. Without aliases, you write orders.customer_id = customers.customer_id. With aliases, o.customer_id = c.customer_id. Always use aliases. Long table names in every column reference make queries unreadable.

LEFT JOIN: Keep Everything From the Left

Left join keeps all rows from the left table. If a row from the left has no match in the right, it still appears but with NULLs for the right table's columns.

result = q("""
    SELECT
        c.customer_id,
        c.name,
        c.city,
        COUNT(o.order_id)  AS total_orders,
        SUM(o.amount)      AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.name, c.city
    ORDER BY total_spent DESC;
""")
print(result)

Output:

   customer_id    name       city  total_orders  total_spent
0            1    Alex     Mumbai             3     90500.0
1            4  Jordan     Mumbai             2     75800.0
2            2   Priya      Delhi             2     46500.0
3            3     Sam  Bangalore             2     12800.0
4            5    Lisa    Chennai             0         NaN
5            6    Ravi      Delhi             0         NaN

All six customers appear. Lisa and Ravi have zero orders and NaN for total_spent. An inner join would have removed them entirely.

Left join is the right choice when you need a complete list of one entity (customers) and want to attach related data when it exists, while still showing entities with no related data.

Use this for finding customers who have never ordered, users who have not logged in, products with no sales.

Finding Rows With No Match

Left join combined with a WHERE NULL check finds the rows that had no match.

result = q("""
    SELECT
        c.customer_id,
        c.name,
        c.city,
        c.tier
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.order_id IS NULL;
""")
print("Customers with no orders:")
print(result)

Output:

Customers with no orders:
   customer_id  name     city    tier
0            5  Lisa  Chennai  Silver
1            6  Ravi    Delhi  Bronze

This pattern is used constantly. Find users who signed up but never bought. Find products that were added but never ordered. Find employees assigned to no projects. The technique is always the same: LEFT JOIN then WHERE right_table.key IS NULL.

RIGHT JOIN: Keep Everything From the Right

Right join is the mirror of left join. All rows from the right table survive regardless of whether they match the left.

result = q("""
    SELECT
        c.name,
        c.city,
        o.order_id,
        o.amount,
        o.customer_id AS order_customer_id
    FROM customers c
    RIGHT JOIN orders o ON c.customer_id = o.customer_id
    ORDER BY o.order_id;
""")
print(result)

Output:

     name       city  order_id   amount  order_customer_id
0    Alex     Mumbai         1  75000.0                  1
1   Priya      Delhi         2  45000.0                  2
...
8     NaN        NaN         8  45000.0                  7
...

Order 8 appears with NaN for name and city because customer_id 7 does not exist in customers. Right join kept all orders. Lisa and Ravi are gone because they have no orders.

Most developers prefer left joins over right joins. The same result can always be achieved by swapping the table order and using a left join. Consistency matters for readability.

FULL OUTER JOIN: Keep Everything

SQLite does not support FULL OUTER JOIN natively but you can simulate it.

result = q("""
    SELECT c.customer_id, c.name, o.order_id, o.amount
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id

    UNION ALL

    SELECT c.customer_id, c.name, o.order_id, o.amount
    FROM orders o
    LEFT JOIN customers c ON o.customer_id = c.customer_id
    WHERE c.customer_id IS NULL;
""")
print(result)
print(f"\nRows: {len(result)}")

Output includes all customers (even with no orders) and all orders (even with no matching customer). Full outer join shows you every row from both tables and flags the ones that could not find a match.

Joining Three Tables

Real queries often span three or more tables.

result = q("""
    SELECT
        c.name                           AS customer,
        c.city,
        p.product_name                   AS product,
        p.category,
        o.amount                         AS sale_price,
        p.cost_price,
        ROUND(o.amount - p.cost_price, 0) AS profit
    FROM orders o
    INNER JOIN customers c ON o.customer_id = c.customer_id
    INNER JOIN products  p ON o.product_id  = p.product_id
    WHERE o.status = 'completed'
    ORDER BY profit DESC;
""")
print(result)

Output:

   customer       city  product         category  sale_price  cost_price   profit
0      Alex     Mumbai   Laptop      Electronics     75000.0     50000.0  25000.0
1    Jordan     Mumbai   Laptop      Electronics     75000.0     50000.0  25000.0
2     Priya      Delhi    Phone      Electronics     45000.0     28000.0  17000.0
...

Three tables. Two joins. One clean result showing customer, product, and profit per transaction. This is the kind of query that lives in every business intelligence tool.

Each new table gets its own JOIN clause. The ON condition connects it to one of the tables already in the query.

Self Join: A Table Joining Itself

Sometimes a table references itself. An employees table where each employee has a manager_id that points to another row in the same table.

conn.executescript("""
    DROP TABLE IF EXISTS employees_hier;
    CREATE TABLE employees_hier (
        emp_id     INTEGER PRIMARY KEY,
        name       TEXT,
        manager_id INTEGER
    );
    INSERT INTO employees_hier VALUES
        (1, 'Director', NULL),
        (2, 'Manager A', 1),
        (3, 'Manager B', 1),
        (4, 'Alex',      2),
        (5, 'Priya',     2),
        (6, 'Sam',       3);
""")
conn.commit()

result = q("""
    SELECT
        e.name        AS employee,
        m.name        AS manager
    FROM employees_hier e
    LEFT JOIN employees_hier m ON e.manager_id = m.emp_id
    ORDER BY e.emp_id;
""")
print(result)

Output:

    employee   manager
0   Director      None
1  Manager A  Director
2  Manager B  Director
3       Alex Manager A
4      Priya Manager A
5        Sam Manager B

The same table joined to itself with two different aliases. e for the employee, m for the manager. Each employee now shows who they report to. The Director has NULL because they report to nobody.

Join Performance: The Rule You Need to Know

Joins are fast when you join on indexed columns. The primary key is always indexed. Foreign keys often are not by default.

If you join on a column that is not indexed and the tables are large, the query can become very slow. Always check what columns your joins use and whether those columns have indexes.

CREATE INDEX idx_orders_customer ON orders(customer_id);
CREATE INDEX idx_orders_product  ON orders(product_id);

Two lines. These make the joins in this post dramatically faster on large tables. In real production databases, this is part of database design. But when you are writing queries against existing databases, knowing to ask "is this column indexed?" can save you from writing queries that take minutes instead of milliseconds.

A Resource Worth Knowing

Joins Visualizer at joins.spathon.com lets you click on different join types and see animated, interactive visualizations of exactly which rows survive each join. Visual, interactive, takes two minutes. Bookmark it. Every time you are uncertain which join type to use, check it.

Jeff Atwood wrote a short classic piece called "A Visual Explanation of SQL Joins" on his blog Coding Horror (codinghorror.com) using Venn diagrams that became the standard way people teach this topic. It is ten years old and still the clearest explanation of join types available. Search "Jeff Atwood visual explanation SQL joins Coding Horror."

Try This

Create sql_joins_practice.py. Use the database from this post.

Write a query that shows each product, how many times it was ordered, total revenue, and the profit margin (sale price minus cost price). Include products that were never ordered, showing zero for orders and revenue.

Find customers who have ordered in both January and February 2024. Use a self-join or subquery approach.

Show the city-level summary: city, number of customers, number of completed orders, total revenue, and average order value per customer. Only include cities with at least two customers.

Using a left join with a NULL check, find any orders where the customer_id does not exist in the customers table. These are data integrity problems.

Write a three-table join that shows customer name, product name, category, amount paid, and whether the customer is Gold, Silver, or Bronze tier. Sort by amount descending.

What's Next

One table queries. Multi-table joins. Both covered. Next is subqueries and CTEs, which let you use the result of one query as the input to another. They make complex multi-step logic readable and maintainable instead of nested and impossible to debug.

DEV Community

Joins: Combining Tables Without Losing Your Mind

The Setup

INNER JOIN: Only the Matches

LEFT JOIN: Keep Everything From the Left

Finding Rows With No Match

RIGHT JOIN: Keep Everything From the Right

FULL OUTER JOIN: Keep Everything

Joining Three Tables

Self Join: A Table Joining Itself

Join Performance: The Rule You Need to Know

A Resource Worth Knowing

Try This

What's Next

Top comments (0)