Mohamed Idris

Posted on May 20

Learning SQL As If You Built It Yourself

#database #sql #learning

If you have ever tried to keep your app's data in a JSON file or in memory, you know how that story ends. At first it works. Two users? Easy. A hundred users with orders, addresses, and a history of returns? Now your "database" is a 40MB file you are scared to open, and looking up a single record takes a full scan of the disk.

You start writing helper functions. Find a user by email. Find all their orders. Make sure no order points at a missing user. After a week of this, you have invented a worse, slower database, and you still cannot answer "what were our top 5 selling books last month?".

That is the gap SQL fills.

What is SQL, really

Think of SQL as a giant spreadsheet on steroids, with a strict librarian living inside. Your data goes into tables (sheets). Each row is a record. Each column has a type and rules. The librarian enforces the rules: no missing required fields, no duplicate ids, no orders pointing to a user that does not exist.

You do not poke around the sheets yourself. You write a polite question in a special language, and the librarian goes off, walks the shelves, and brings back exactly the rows you asked for. The question is declarative. You say what you want, not how to get it. The librarian (the query planner) figures out the fastest path.

That is the whole vibe.

Let's pretend we are building one

We want a way to store structured data, enforce rules between pieces of data, and answer questions about it without writing custom code every time. We will call the language SQL (Structured Query Language) and the engine behind it a relational database.

For our running example, we are opening a tiny online bookstore. Books, authors, customers, orders. We will model the whole thing and learn to ask it useful questions.

Decision 1: Data lives in tables, with a strict shape

A table has a fixed set of columns. Each column has a type. Each row must obey the shape. No "this row has an extra field" surprises.

CREATE TABLE authors (
  id          BIGSERIAL    PRIMARY KEY,
  name        TEXT         NOT NULL,
  born_year   INT
);

CREATE TABLE books (
  id          BIGSERIAL    PRIMARY KEY,
  title       TEXT         NOT NULL,
  author_id   BIGINT       NOT NULL REFERENCES authors(id),
  price_cents INT          NOT NULL CHECK (price_cents >= 0),
  published   DATE,
  created_at  TIMESTAMPTZ  NOT NULL DEFAULT now()
);

CREATE TABLE customers (
  id     BIGSERIAL PRIMARY KEY,
  email  TEXT      NOT NULL UNIQUE,
  name   TEXT      NOT NULL
);

CREATE TABLE orders (
  id          BIGSERIAL    PRIMARY KEY,
  customer_id BIGINT       NOT NULL REFERENCES customers(id),
  book_id     BIGINT       NOT NULL REFERENCES books(id),
  qty         INT          NOT NULL CHECK (qty > 0),
  placed_at   TIMESTAMPTZ  NOT NULL DEFAULT now()
);

A few things worth pausing on, because they show up everywhere:

PRIMARY KEY is the row's unique identity. Every table has one.
REFERENCES other_table(id) is a foreign key. The librarian will refuse to insert an order for a customer that does not exist, and refuse to delete a customer who still has orders (unless you say ON DELETE CASCADE).
NOT NULL, UNIQUE, CHECK are constraints that bake business rules into the data, not into the app code. The database becomes the last line of defense, even if six different apps write to it.
DEFAULT lets the database fill in values for you (timestamps, ids, version numbers).

This shape is the schema. The whole database is just schemas plus rows.

Quick sanity rule: if your data has a stable shape and relationships matter, use SQL. If your data is messy, document like, and varies row to row, that is what we will discuss in the MongoDB post.

Decision 2: Four verbs, infinite combinations

Every interaction with SQL is one of four verbs. Learn these and you have learned 70% of the language.

Insert

INSERT INTO authors (name, born_year)
VALUES ('Saint-Exupery', 1900);

INSERT INTO books (title, author_id, price_cents, published)
VALUES
  ('The Little Prince', 1, 999,  '1943-04-06'),
  ('Night Flight',      1, 1199, '1931-01-01');

Select (the workhorse)

SELECT id, title, price_cents
FROM   books
WHERE  price_cents < 1500
ORDER  BY published DESC
LIMIT  10;

Read it like English: pick these columns, from this table, where this is true, sorted this way, give me the first 10.

Update

UPDATE books
SET    price_cents = 1099
WHERE  id = 1;

If you forget the WHERE, you update every row. Yes, every senior has done it once. Wrap big updates in a transaction, we will see how soon.

Delete

DELETE FROM books WHERE id = 1;

Same warning as UPDATE. No WHERE, no rows left.

Decision 3: Filtering rows the librarian can understand

Inside WHERE you can stack conditions. A short list of the most useful operators:

SELECT * FROM books
WHERE  price_cents BETWEEN 800 AND 1500
   AND title       ILIKE '%prince%'        -- case insensitive in Postgres
   AND author_id   IN (1, 2, 3)
   AND published  IS NOT NULL
   AND published >= DATE '2000-01-01';

The classic gotcha: NULL is not equal to anything, not even itself. WHERE col = NULL always returns no rows. Use IS NULL and IS NOT NULL. Always.

LIKE 'foo%' matches the prefix foo. LIKE '%foo' matches the suffix. LIKE '%foo%' is the full text contains. The percent sign is the wildcard.

Decision 4: Joining tables, the actual superpower

Splitting data into multiple tables only pays off if we can stitch them back together at query time. That is what joins do.

The four flavors that matter:

-- INNER JOIN: only rows that match in both tables
SELECT b.title, a.name AS author
FROM   books b
JOIN   authors a ON a.id = b.author_id;

-- LEFT JOIN: every row from the left, with NULLs where the right has nothing
SELECT a.name, b.title
FROM   authors a
LEFT JOIN books b ON b.author_id = a.id;
-- An author with zero books still appears, with title = NULL.

-- RIGHT JOIN: mirror image of LEFT JOIN. Rarely used (just flip the tables).

-- FULL OUTER JOIN: every row from both sides, NULL where the other is missing.

The ON clause says how to match. The WHERE clause filters the joined result.

Two senior level rules that catch newcomers:

A LEFT JOIN plus a WHERE on the right table acts like an INNER JOIN. If you write LEFT JOIN books WHERE books.published IS NOT NULL, you have just thrown away the unmatched rows. Put filters that should keep the unmatched rows inside the ON clause: LEFT JOIN books b ON b.author_id = a.id AND b.published IS NOT NULL.
Always alias your tables in non trivial queries. b.title is much easier to read (and grep) than books.title repeated five times.

Decision 5: Aggregations, the "summarize this" verbs

When you do not want individual rows but a roll up, you reach for GROUP BY and the aggregate functions: COUNT, SUM, AVG, MIN, MAX.

-- Top selling books last month
SELECT b.id, b.title, SUM(o.qty) AS sold
FROM   orders o
JOIN   books  b ON b.id = o.book_id
WHERE  o.placed_at >= NOW() - INTERVAL '30 days'
GROUP  BY b.id, b.title
ORDER  BY sold DESC
LIMIT  5;

Two rules to internalize:

Every column in the SELECT must either be in the GROUP BY or wrapped in an aggregate.
HAVING filters groups. WHERE filters rows before grouping. Do not confuse them.

SELECT customer_id, COUNT(*) AS orders_count
FROM   orders
GROUP  BY customer_id
HAVING COUNT(*) >= 3;        -- only customers with 3+ orders

Decision 6: When one query is not enough, build it in pieces

For complex queries, stacking joins and aggregations gets unreadable fast. We added two tools for that.

Subqueries

A query inside a query. Treat its result as a virtual table.

SELECT title
FROM   books
WHERE  author_id IN (SELECT id FROM authors WHERE born_year < 1950);

Common Table Expressions (CTEs)

Same idea, named, top to bottom. Far more readable when the query has multiple stages.

WITH top_books AS (
  SELECT book_id, SUM(qty) AS sold
  FROM   orders
  WHERE  placed_at >= NOW() - INTERVAL '30 days'
  GROUP  BY book_id
)
SELECT b.title, t.sold
FROM   top_books t
JOIN   books b ON b.id = t.book_id
ORDER  BY t.sold DESC
LIMIT  5;

CTEs can be recursive, which is how you walk tree shaped data (categories with subcategories, employees with managers, comments with replies):

WITH RECURSIVE descendants AS (
  SELECT id, parent_id, name FROM categories WHERE id = 7
  UNION ALL
  SELECT c.id, c.parent_id, c.name
  FROM   categories c
  JOIN   descendants d ON d.id = c.parent_id
)
SELECT * FROM descendants;

CTEs read top to bottom like a small program. Use them.

Decision 7: Window functions, the senior superpower

Sometimes you want a per row calculation that "sees" other rows around it without collapsing into groups. That is what window functions do.

-- Rank books by sales within each author
SELECT
  b.author_id,
  b.title,
  SUM(o.qty) AS sold,
  RANK() OVER (
    PARTITION BY b.author_id
    ORDER BY SUM(o.qty) DESC
  ) AS rank_for_author
FROM books b
JOIN orders o ON o.book_id = b.id
GROUP BY b.author_id, b.title;

OVER(...) says "do this calculation across this window". PARTITION BY is the group, ORDER BY is the order within the group. You get one output row per input row, plus the calculated value.

Common ones to know:

ROW_NUMBER() for "give each row a serial number".
RANK() and DENSE_RANK() for tied ranking.
LAG(col), LEAD(col) for "value of the previous/next row in the window".
SUM(col) OVER (ORDER BY ...) for running totals.

Once these click, a whole class of "I had to do this in app code" problems become one liners.

Decision 8: Indexes, the secret to fast queries

A table without indexes is a phone book printed in random order. Finding "Alice" means starting at page one. With an index, the librarian builds an alphabetical map: "Alice is around page 47, go straight there".

CREATE INDEX books_author_id_idx ON books (author_id);
CREATE INDEX orders_customer_placed_idx ON orders (customer_id, placed_at DESC);
CREATE UNIQUE INDEX customers_email_idx ON customers (email);

The senior level rules:

Primary keys and unique constraints already create indexes. You get those free.
Index columns you filter or join on, not columns you only display.
Composite index column order matters. (customer_id, placed_at) helps queries that filter by customer_id (alone) or by customer_id plus placed_at. It does not help a query that only filters by placed_at.
Postgres has special index types: GIN (for arrays and JSONB), BRIN (for huge time series), partial (WHERE active = true), expression (LOWER(email)).
Indexes are not free. Every write has to update them. Do not index every column "just in case". Profile first.

To find out what the database is actually doing for a query, ask:

EXPLAIN ANALYZE
SELECT * FROM books WHERE author_id = 1;

That returns the query plan. Look for "Index Scan" (good) versus "Seq Scan" (full table read, fine for small tables, scary for big ones), and look at the actual time per step.

Reading query plans is the single skill that separates "writes SQL" from "writes fast SQL". Spend time on it.

Decision 9: Transactions, all or nothing

Sometimes a single user action needs multiple statements, and "half done" is the worst possible state. Money moved out of one account but never into the other. An order created without payment.

We wrap the statements in a transaction. Either every statement commits, or none of them do.

BEGIN;

UPDATE accounts SET balance = balance - 1000 WHERE id = 1;
UPDATE accounts SET balance = balance + 1000 WHERE id = 2;

COMMIT;        -- or ROLLBACK if anything went wrong

Transactions guarantee four properties together known as ACID:

Atomicity: all or nothing.
Consistency: constraints stay valid before and after.
Isolation: concurrent transactions do not see each other's half done work.
Durability: once committed, it survives a crash.

Isolation has levels (read committed, repeatable read, serializable). The default in Postgres is READ COMMITTED, which is fine for most apps. Reach for SERIALIZABLE when you have a critical multi step invariant and you would rather have the database retry than risk anomalies. Read about phantom reads and lost updates once, then you will recognize the bug shapes when you see them.

Decision 10: Views and materialized views

If a complex query is something you ask all the time, give it a name.

CREATE VIEW top_books_30d AS
SELECT b.id, b.title, SUM(o.qty) AS sold
FROM   orders o
JOIN   books  b ON b.id = o.book_id
WHERE  o.placed_at >= NOW() - INTERVAL '30 days'
GROUP  BY b.id, b.title;

SELECT * FROM top_books_30d ORDER BY sold DESC LIMIT 5;

A view is just a saved query. It runs every time you SELECT from it.

A materialized view stores the result physically and refreshes on demand. Great for expensive analytics that do not need to be real time.

CREATE MATERIALIZED VIEW top_books_30d AS ...;
REFRESH MATERIALIZED VIEW top_books_30d;

Decision 11: Set operations

Less famous, but very handy.

-- Books either by author 1 or with sales (and both)
SELECT id FROM books WHERE author_id = 1
UNION
SELECT book_id FROM orders;

-- Same idea but keep duplicates
UNION ALL

-- In A but not in B
EXCEPT

-- In both
INTERSECT

UNION deduplicates and is slower. UNION ALL keeps everything and is faster. Pick on purpose.

Decision 12: Talking to SQL safely from your app

This is the one rule you must never break.

Never build a query by string concatenation with user input. That is how you get SQL injection, the most embarrassing bug in our profession.

// NEVER
db.query(`SELECT * FROM users WHERE email = '${email}'`);

// ALWAYS
db.query("SELECT * FROM users WHERE email = $1", [email]);

Use parameterized queries. Every modern client supports them. The driver sends the SQL and the values separately, so the database never confuses one for the other.

The other classic app side trap is the N+1 query problem. You fetch a list of authors, then for each one, you fire a separate query for their books. 1 query becomes 1 + N queries. The fix is one query with a JOIN, or batch loading by id list (WHERE author_id IN (...)).

Decision 13: Designing schemas that age well

The art is in the schema, not the queries. Tables you regret are tables you fight forever. A short senior level checklist:

Normalize first, denormalize on purpose. A normalized schema (no duplicated facts) is easier to keep correct. You can always add a denormalized cache column later if profiling demands it.
Pick the right primary key. BIGSERIAL (auto increment) is fine for most things. UUIDs are great for distributed systems and for not exposing row counts in URLs. Use UUID v7 if you need sortable UUIDs.
Always store timestamps with timezone. TIMESTAMPTZ, not TIMESTAMP. Future you will thank present you.
Money in integer cents, not floats. 1.10 + 2.20 in floats is heartbreak.
Booleans are fine. Do not use a CHAR(1) "Y"/"N". This is not 1995.
Add indexes for foreign keys. The database does not always do this for you, and queries that join on them will sit and cry without one.
Soft delete with a deleted_at TIMESTAMPTZ column if you need recoverable deletes. Then add WHERE deleted_at IS NULL to your reads (or a partial index).
Migrations are version controlled SQL. Tools like Flyway, Sqitch, Prisma Migrate, Alembic. Never edit the production schema by hand.

Decision 14: A peek under the hood

What really happens when you run a query:

The client sends the SQL to the server (bind parameters separately if you used a parameterized query).
The parser turns the text into a syntax tree.
The planner / optimizer considers many possible execution plans (full scan, index scan, different join orders, hash join vs merge join vs nested loop). It uses statistics about the table sizes and column distributions to pick the cheapest plan it can find.
The executor runs the chosen plan, reading pages from disk or the buffer cache, joining, filtering, aggregating.
Results stream back to the client. Transactions wrap all of this in WAL (write ahead log) entries for durability.

That little planner is the brain of the whole thing. It is also why the same query can be fast on Tuesday and slow on Wednesday: the data shape changed, and the planner picked a different plan. Keep statistics fresh (ANALYZE in Postgres) and you will rarely be surprised.

A short guide to dialects

SQL is a standard. Real engines extend and tweak it. The big three you will see:

PostgreSQL: the senior favorite. Best in class JSON support, window functions, CTEs, generated columns, partial indexes, RETURNING, extensions. Default choice for most new apps in 2026.
MySQL / MariaDB: massive ecosystem, fine performance, weaker default constraints (set STRICT_ALL_TABLES), historically weaker on JSON and window features (now caught up).
SQLite: a single file database, perfect for embedded use, tests, mobile, and surprisingly capable for small to mid sized apps. Great for prototypes.

Almost every example in this post is portable. The places you will find dialect differences: identifier quoting, auto increment syntax, JSON functions, full text search, and date functions.

Tiny tips that will save you later

SELECT * is fine in the editor, dangerous in production code. List the columns you need.
Always run an UPDATE or DELETE as a SELECT first. Same WHERE, same eyes.
Wrap risky changes in a transaction with BEGIN; ... ROLLBACK; until you are certain.
COUNT(*) counts rows. COUNT(col) skips NULLs. They are different.
Prefer EXISTS over IN for subqueries that may grow large.
Filter early. Push WHERE and joins as low as possible so fewer rows climb the pipeline.
Read the query plan before optimizing. Guessing is a waste of an evening.
Connection pool, do not connect per request. Use PgBouncer, the language client's built in pool, or your framework's helper.
Put long names into a style guide. snake_case for tables and columns, plural for tables (books, not book). Pick one and never argue about it again.
Back up the database. Test the restore. A backup you have never restored is a wish, not a backup.

Wrapping up

So that is the whole story. We were tired of inventing tiny worse databases out of files and helper functions. We built one engine that holds structured data, enforces relationships, and answers questions in a declarative language we agreed to call SQL.

We taught the engine four verbs: INSERT, SELECT, UPDATE, DELETE. We let it stitch tables back together with joins, summarize with aggregations, and walk windows with window functions. We added CTEs so complex queries read top to bottom. We made it fast with indexes, and gave it transactions and ACID guarantees so half done work is never visible. We baked rules into the schema with constraints, so the data stays sane even when buggy apps try to mess it up.

Once that map is in your head, every database tutorial, blog post, and codebase starts to feel familiar. SQL stops feeling like a foreign language and starts feeling like a calm conversation with a very picky librarian who is, secretly, on your side.

Happy querying, and put a bookmark in chapter two for me.

DEV Community