Table of Contents
-
Designing Relational Schemas in SQLite — CS50 SQL Lecture 2 (Deep Notes + Examples)
- Table of Contents
- Why Schema Design Matters
- Peeking Under the Hood:
.schemaand Friends - Modeling the Real World: Entities & Relationships
- Normalization (1NF → 3NF) — A Beginner‑Friendly, No‑Hand‑Waving Guide
- Start: A denormalized log
- Key terms (plain English)
- What is a multi‑valued field and why split it?
- What are atomic columns / “each cell is atomic”?
- What are entities vs facts?
- 1NF — First Normal Form (the “atomic values” rule)
- 2NF — Second Normal Form (avoid partial dependency)
- 3NF — Third Normal Form (avoid transitive dependency)
- BCNF — Boyce–Codd Normal Form (intuitive view)
- Quick “am I normalized?” checklist
- Minimal working SQL (SQLite) for the subway example
- TL;DR
- SQLite Storage Classes vs. Type Affinities
- Creating Tables
- Keys: Primary, Foreign, and Composite
- Primary Keys (PK)
- Foreign Keys (FK)
- Composite Keys
- Column & Table Constraints
- Evolving a Schema:
ALTER TABLE&DROP TABLE - Design Gotchas & Practical Tips
- Cheat Sheet: Commands & Patterns
- Further Practice Ideas
Why Schema Design Matters
Your schema is the shape of your data. Good schemas reduce duplication, keep writes and reads reliable, and make “the right” queries easy to express. In this lecture, we move from using schemas to designing them: declaring tables, picking types, defining keys, and applying constraints that enforce the rules of your domain.
Peeking Under the Hood: .schema and Friends
SQLite gives you a quick way to inspect the structure of a database.
# Open a database
sqlite3 path/to/db.sqlite3
# Show the CREATE statements for the whole DB
.schema
# Or for just one table
.schema books
# Pretty-print query results
.headers on
.mode table
You can also quickly preview a table:
SELECT author, title
FROM longlist
LIMIT 5;
Modeling the Real World: Entities & Relationships
Typical entities you’ll meet:
-
Thing: e.g.,
Book,Author,Station,Card -
Relationship: e.g.,
Authorship,Visit/Swipe
A relationship can be:
- One-to-many (a publisher → many books)
- Many-to-many (authors ↔ books; riders ↔ stations via visits/swipes)
We represent many-to-many with a junction (aka join/associative) table:
Riders (id PK, name ...)
Stations (id PK, name, line ...)
Visits (id PK, rider_id FK → Riders.id, station_id FK → Stations.id)
Normalization (1NF → 3NF) — A Beginner‑Friendly, No‑Hand‑Waving Guide
Normalization reduces redundancy and prevents update anomalies. Below is a practical, example‑driven treatment using the “subway rides” story, rewritten to answer common beginner questions explicitly.
Start: A denormalized log
rides_log
--------------------------------------------------------------
| id | rider_name | station_name | action | fare | balance |
|----|------------|--------------|--------|------|----------|
| 1 | Charlie | Kendall/MIT | enter | 0.10 | 0.05 |
| 2 | Charlie | Jamaica Plain| exit | 0.05 | 0.00 |
| 3 | Alice | Harvard | enter | 0.10 | 0.20 |
| 4 | Alice | Park Street | exit | 0.05 | 0.15 |
Problems
- Rider names repeat; no way to distinguish two “Charlies.”
- Station names repeat; “Park Street” string appears everywhere.
- Updating a station’s metadata (e.g., line color) requires touching many rows.
- Rider‑specific facts (e.g., email) don’t belong in a station row, and vice versa.
Key terms (plain English)
What is a multi‑valued field and why split it?
A column that stores more than one value at once (e.g., "Red,Blue" or ["Red","Blue"]). In 1NF, each column holds one value per row.
Fix: Move the repeated values to a separate table and connect via a junction table.
What are atomic columns / “each cell is atomic”?
“Atomic” means one value, one meaning, not a bundle. If you often need a part of a value, it isn’t atomic for your use:
-
full_namemay be OK—unless you often needfirst_nameandlast_nameseparately. - Avoid lists/CSV/arrays/JSON blobs in a single cell.
What are entities vs facts?
-
Entities are the “nouns” you track over time (have identity):
Rider,Station. -
Facts are properties or events about entities: a visit/swipe that connects a rider and a station and records
action,fare,balance_after.
1NF — First Normal Form (the “atomic values” rule)
Rule of thumb:
- Every table has a primary key (unique identifier for each row).
- Each cell is atomic (no multi‑valued fields, no lists).
- No repeating groups inside a row.
From rides_log, split entities into their own tables and move the relationship into a fact table:
Riders (id PK, name)
Stations (id PK, name, line)
Visits (id PK, rider_id → Riders.id, station_id → Stations.id, action, fare, balance_after, ts)
Now each column holds a single value, tables have stable keys, and rider/station strings aren’t duplicated all over the log.
Tip: If you do need sets (e.g., a rider’s favorite lines), use a junction table:
CREATE TABLE lines( id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE ); CREATE TABLE rider_lines( rider_id INTEGER NOT NULL REFERENCES riders(id), line_id INTEGER NOT NULL REFERENCES lines(id), PRIMARY KEY (rider_id, line_id) );
2NF — Second Normal Form (avoid partial dependency)
When it applies: your table’s primary key is composite (e.g., (rider_id, station_id)).
Rule: Every non‑key column must depend on the whole key—not just part of it.
Why our Visits needs its own id:
If Visits used PK (rider_id, station_id), attributes like action, fare, balance_after, ts describe a single event, not the pair itself. They don’t depend on the entire key; they depend on the visit (an occurrence).
Fix: Give Visits its own surrogate key id (or rely on SQLite’s rowid), so every other column depends on that one key.
CREATE TABLE visits (
id INTEGER PRIMARY KEY,
rider_id INTEGER NOT NULL REFERENCES riders(id),
station_id INTEGER NOT NULL REFERENCES stations(id),
action TEXT NOT NULL CHECK (action IN ('enter','exit')),
fare NUMERIC NOT NULL CHECK (fare >= 0),
balance_after NUMERIC NOT NULL,
ts TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
3NF — Third Normal Form (avoid transitive dependency)
Rule: Non‑key attributes must not depend on other non‑key attributes (only on the key).
Example trap: Putting station_line into Visits:
-
station_linedepends onstation_id(a non‑key attribute) which depends onvisits.id(the key). That’s a transitive dependency. -
Fix: Keep
linesolely inStationsand join when needed.
-- Good: fetch station metadata via JOIN
SELECT v.id, r.name AS rider, s.name AS station, s.line, v.action, v.fare, v.ts
FROM visits v
JOIN riders r ON r.id = v.rider_id
JOIN stations s ON s.id = v.station_id;
BCNF — Boyce–Codd Normal Form (intuitive view)
Stricter than 3NF. For every non‑trivial dependency X → Y, X must be a key.
If some non‑key attribute (or set) functionally determines others, you likely need to decompose.
Classic BCNF violation:
courses(course_id PK, instructor, room)
Business rule: each instructor always teaches in one fixed room.
So: instructor → room, but instructor is NOT a key → violates BCNF.
Decompose:
CREATE TABLE instructors(
instructor TEXT PRIMARY KEY,
room TEXT NOT NULL
);
CREATE TABLE courses(
course_id INTEGER PRIMARY KEY,
instructor TEXT NOT NULL REFERENCES instructors(instructor)
);
Quick “am I normalized?” checklist
- Entities live in their own tables and have keys.
- Facts (events/relationships) live in junction/event tables.
- 1NF: No lists in cells; every table has a key.
- 2NF: With composite keys, every non‑key column depends on the whole key.
- 3NF: Non‑key columns don’t depend on other non‑key columns.
- BCNF (nice‑to‑have): If something determines other columns, it should be a key.
Red flags: CSV in a column, duplicate text everywhere, columns that can be derived from a FK target table, “shadow keys” (like station_name uniquely identifying a station but not declared as a key).
Minimal working SQL (SQLite) for the subway example
CREATE TABLE riders (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE stations (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
line TEXT NOT NULL
);
CREATE TABLE visits (
id INTEGER PRIMARY KEY,
rider_id INTEGER NOT NULL,
station_id INTEGER NOT NULL,
action TEXT NOT NULL CHECK (action IN ('enter','exit')),
fare NUMERIC NOT NULL CHECK (fare >= 0),
balance_after NUMERIC NOT NULL,
ts TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (rider_id) REFERENCES riders(id),
FOREIGN KEY (station_id) REFERENCES stations(id)
);
Query (join facts with entities):
SELECT v.id, r.name AS rider, s.name AS station, s.line, v.action, v.fare, v.balance_after, v.ts
FROM visits v
JOIN riders r ON r.id = v.rider_id
JOIN stations s ON s.id = v.station_id
ORDER BY v.ts;
Note: In production MBTA-like systems, “riders” are often replaced by cards and
visitsbecome swipes. The same normalization logic applies.
TL;DR
- 1NF: One value per cell; no lists; every table has a key.
-
2NF: With composite keys, non‑key columns must depend on the whole key (or just give the table a surrogate
id). - 3NF: Don’t keep attributes that are derivable from another table via a FK (avoid transitive dependencies).
- BCNF: If a set of columns determines others, that set should be a key—otherwise, decompose.
- Separate entities from facts, and join them when you need a complete picture.
SQLite Storage Classes vs. Type Affinities
Storage classes (per value):
-
NULL— absence of a value -
INTEGER— whole numbers (1, 2, …; width chosen by SQLite) -
REAL— floating-point -
TEXT— strings -
BLOB— raw bytes, stored exactly as given
Type affinities (per column):
-
TEXT,NUMERIC,INTEGER,REAL,BLOB
SQLite tries to coerce inserted values to the column’s affinity. Example:
-- TEXT affinity; values get stored as text
CREATE TABLE fares_text(amount TEXT);
INSERT INTO fares_text VALUES (10); -- stored as "10"
INSERT INTO fares_text VALUES ('25'); -- stored as "25"
-- INTEGER affinity; numeric-looking text coerces to INTEGER
CREATE TABLE fares_int(amount INTEGER);
INSERT INTO fares_int VALUES ('10'); -- stored as 10 (integer)
INSERT INTO fares_int VALUES (25); -- stored as 25
Money values: Prefer storing cents as INTEGER (
240= $2.40) to avoid floating-point rounding issues. Convert on display.
Creating Tables
CREATE TABLE riders (
id INTEGER,
name TEXT
);
Better with constraints and types (see later sections), but start simple and iterate.
Pro tip: Maintain a schema.sql you can .read into SQLite to recreate your DB from scratch.
sqlite3 mbta.db
.read schema.sql
.schema
Keys: Primary, Foreign, and Composite
Primary Keys (PK)
Uniquely identify rows. In SQLite, you can:
- Declare an explicit PK:
CREATE TABLE stations (
id INTEGER,
name TEXT NOT NULL UNIQUE,
line TEXT NOT NULL,
PRIMARY KEY (id)
);
- Or rely on the hidden
rowid(SQLite’s implicit PK) — but explicit is clearer.
Foreign Keys (FK)
Reference a PK in another table. Enable enforcement in SQLite:
PRAGMA foreign_keys = ON;
Example:
CREATE TABLE visits (
id INTEGER,
rider_id INTEGER,
station_id INTEGER,
PRIMARY KEY (id),
FOREIGN KEY (rider_id) REFERENCES riders(id),
FOREIGN KEY (station_id) REFERENCES stations(id)
);
Composite Keys
Sometimes the combination of columns is the key:
CREATE TABLE enrollment (
student_id INTEGER,
course_id INTEGER,
PRIMARY KEY (student_id, course_id)
);
Use composite keys when the pair is truly unique and repeated events don’t make sense. If events repeat (e.g., multiple swipes at the same station), use a surrogate id instead.
Column & Table Constraints
-
NOT NULL— required column -
UNIQUE— no duplicate values in the column -
CHECK (...)— custom rules -
DEFAULT <expr>— value to use when none provided -
PRIMARY KEY (...),FOREIGN KEY (...) REFERENCES ...— table constraints
Examples:
-- Stations that *must* have a name and line; names must be unique
CREATE TABLE stations (
id INTEGER,
name TEXT NOT NULL UNIQUE,
line TEXT NOT NULL,
PRIMARY KEY (id)
);
-- Visits with FKs
CREATE TABLE visits (
id INTEGER,
rider_id INTEGER,
station_id INTEGER,
action TEXT CHECK (action IN ('enter','exit')),
fare_cents INTEGER CHECK (fare_cents >= 0),
PRIMARY KEY (id),
FOREIGN KEY (rider_id) REFERENCES riders(id),
FOREIGN KEY (station_id) REFERENCES stations(id)
);
DEFAULT + timestamps:
CREATE TABLE events (
id INTEGER PRIMARY KEY,
happened NUMERIC NOT NULL DEFAULT CURRENT_TIMESTAMP
);
NUMERICaffinity works well forCURRENT_TIMESTAMPin SQLite.
Evolving a Schema: ALTER TABLE & DROP TABLE
You won’t get it perfect the first time. Change it safely:
-- Rename a table
ALTER TABLE visits RENAME TO swipes;
-- Add a column
ALTER TABLE swipes ADD COLUMN type TEXT;
-- Rename a column
ALTER TABLE swipes RENAME COLUMN type TO swipe_type;
-- Drop a column (SQLite ≥ 3.35.0)
ALTER TABLE swipes DROP COLUMN swipe_type;
-- Delete a table
DROP TABLE riders;
Prefer editing schema.sql and rebuilding during early design; use ALTER TABLE for production migrations.
Design Gotchas & Practical Tips
-
Booleans in SQLite: No native
BOOLEAN; useINTEGERwith0/1orTEXTwith'true'/'false'. -
Foreign keys are off by default:
PRAGMA foreign_keys = ON;per connection. - Money: Store cents as INTEGER. Convert to dollars at the edges.
-
Timestamps:
NUMERIC+DEFAULT CURRENT_TIMESTAMPis fine. If you need time zones, store UTC. -
Composite vs. surrogate PKs: Use composite keys when the pair naturally identifies a row and repeats don’t happen. Otherwise, prefer a surrogate
id. - Indexes: Add indexes on FKs/lookup columns when data grows:
CREATE INDEX idx_swipes_card ON swipes(card_id);
CREATE INDEX idx_swipes_station ON swipes(station_id);
-
Data validation: Push domain rules into
CHECKs where possible (e.g., sign ofamount_centsbased ontype). -
Migrations: Keep
schema.sqlunder version control; addseed.sqlfor small demo data. - Modes: For nicer output while exploring:
.headers on
.mode table
Cheat Sheet: Commands & Patterns
-- Inspect
.schema
.schema table_name
-- Create
CREATE TABLE name (...);
-- Keys & constraints (table-level)
PRIMARY KEY (col [, col2])
FOREIGN KEY (col) REFERENCES other(col)
-- Column constraints
NOT NULL
UNIQUE
CHECK (expr)
DEFAULT expr
-- Modify
ALTER TABLE t RENAME TO new_t;
ALTER TABLE t ADD COLUMN col TYPE [constraints];
ALTER TABLE t RENAME COLUMN old TO new;
ALTER TABLE t DROP COLUMN col; -- requires SQLite 3.35+
DROP TABLE t;
-- Enable FK enforcement
PRAGMA foreign_keys = ON;
Further Practice Ideas
-
Extend
cards: add astatuscolumn and aCHECKto block use whenstatus='blocked'. - Fare rules: Implement a peak/off-peak fare table; join in a trigger to pick the right fare at insert time.
-
Transfers: Model free transfers within N minutes with a
swipe_windowrule. -
Reporting queries:
- Top 10 busiest stations (by swipes) per day.
- Average deposit size per month.
- Cards with negative balance inserts (should be impossible if checks are correct!).
Top comments (0)