sql data types decide three things on every column you ever ship — how many bytes the row costs on disk, how fast the planner can compare values, and whether arithmetic returns the answer the business expects. Every senior data engineering review reaches for the type catalogue because the wrong choice at table-design time becomes a silent bug under load — INT overflow on a fast-growing event table, FLOAT rounding on a payment ledger, TIMESTAMP without time zone drifting across regions.
This guide is a working reference for the types of sql data types that show up in every schema: integers and decimals, fixed and variable-length strings, calendar and clock columns, and the semi-structured JSON / ARRAY / STRUCT families that modern warehouses lean on. Across five primitives — numeric, string, date-time, semi-structured, plus a domain-by-domain cheat sheet — we cover what each sql types family stores, when to reach for it, and the data datatype in sql pitfalls that catch experienced engineers. Each section pairs a teaching block with a worked example showing the code, a step-by-step trace, an output table, and a concept-by-concept why-this-works breakdown.
When you want hands-on reps immediately after reading, browse SQL practice library →, drill type-casting problems →, and rehearse on date-functions problems →.
On this page
- Why data types decide query speed and storage cost
- Numeric types — INT, NUMERIC, FLOAT, DOUBLE
- String types — CHAR, VARCHAR, TEXT, encoding
- Date and time types — DATE, TIME, TIMESTAMP, TIMESTAMPTZ, INTERVAL
- Semi-structured types — JSON, JSONB, ARRAY, STRUCT
- Cheat sheet — pick the right type by domain
- Frequently asked questions
- Practice on PipeCode
1. Why data types decide query speed and storage cost
Types are the only contract the database, the planner, and the application all share
The one-sentence invariant: a column's data type tells the storage engine how many bytes to reserve per row, tells the planner which comparison operator to use, and tells every downstream application what shape the value will take when it crosses the wire. Once you internalise that mental model — bytes first, semantics second — every sql data types decision collapses into a three-axis trade-off between storage, performance, and correctness.
Why "any type that fits" is the wrong instinct.
-
Storage. A
BIGINTis 8 bytes, anINTis 4. On a billion-row event table that is 4 GB of difference, plus the same again for every index that touches the column. -
Performance. Comparing two
NUMERIC(38,9)values is materially slower than comparing twoBIGINTvalues because the database must walk a variable-length representation. Multiply that by every join key and everyWHEREpredicate. -
Correctness.
FLOATcannot exactly represent0.1— the canonical demo is0.1 + 0.2 = 0.30000000000000004. Storing money inFLOATis a guaranteed audit-failure waiting for enough volume.
The three levers types control.
-
Storage layout. Fixed-width types (
INT,BIGINT,DATE,TIMESTAMP) sit at predictable offsets in the row — fast row scans, easy vectorisation. Variable-width types (VARCHAR,TEXT,NUMERIC,JSONB) include a length prefix or out-of-row pointer. -
Operator dispatch. Each type ships with its own equality, comparison, hash, and arithmetic operators. A
TEXT = INTcomparison either fails with a type error (strict dialects like Postgres) or quietly coerces one side to the other (MySQL, SQLite) — the latter is where silent bugs live. - Index eligibility. Some indexes only support fixed-width types; others (B-tree, GIN, GiST) handle variable-width but with different cost. Choosing a type implicitly picks which indexes are useful on the column.
Type-mismatch bugs that look like data bugs.
-
INToverflow. A signed 32-bitINTtops out at ~2.1 billion. A fast-moving event sequence will hit that ceiling and the next insert either errors out (Postgres) or wraps to a negative number (MySQL with strict-mode off). -
VARCHAR(N)truncation. MySQL withsql_modenot includingSTRICT_TRANS_TABLESwill silently truncate a 30-character string into aVARCHAR(20)column. The application sees no error; the report sees mangled data. -
FLOAT == FLOATtraps.WHERE amount = 0.30will fail to match a row whoseFLOATamountis actually0.30000000000000004. The bug looks like "missing data" until somebody traces the binary representation. -
Time zone drift. Storing a
TIMESTAMP(no time zone) and presenting it in two different application time zones produces two different wall-clock answers for the same row — a guaranteed bug as soon as your team crosses a continent.
Dialect variations that bite at migration time.
-
BOOLEAN— Postgres has a nativeboolean. MySQL stores it asTINYINT(1). Snowflake has a realBOOLEAN. Always check before porting. -
TEXTvsVARCHAR(MAX)— PostgresTEXTandVARCHAR(n)use the same storage; SQL Server distinguishesVARCHAR(MAX)fromVARCHAR(8000); Oracle usesCLOB. -
TIMESTAMPsemantics — PostgresTIMESTAMPhas no time zone;TIMESTAMPTZstores UTC. MySQLTIMESTAMPquietly converts to the session time zone on read;DATETIMEdoes not. Snowflake hasTIMESTAMP_NTZ,TIMESTAMP_LTZ,TIMESTAMP_TZ— three distinct semantics. -
JSONvsJSONB— Postgres has both;JSONBis the binary, indexable, almost-always-correct choice. MySQL hasJSON(also binary). Snowflake usesVARIANT. BigQuery usesJSONplus first-classSTRUCTandARRAY.
What interviewers and senior reviewers listen for.
- Do you reach for
NUMERIC(p, s)the moment money is mentioned? — senior signal. - Do you say "always store UTC, present in app TZ" when time zones come up? — required answer.
- Do you ask "what is the upper bound on this string?" before choosing
VARCHAR(n)vsTEXT? — required answer. - Do you mention
JSONB+ a GIN index when "flexible schema" comes up in Postgres? — current-default signal.
Worked example — type-driven storage cost on a billion-row event table
Detailed explanation. A realistic schema-design pause point is: "we expect a billion rows in this event table — how much do my column choices cost?" The answer reveals exactly how much weight types carry in a production warehouse. Below we walk a six-column event table where each column is chosen twice — once as a defensible production type, once as a casual choice that looks similar at design time but balloons on disk.
Question. Estimate the on-disk size of a billion-row event table under two schemas: production_schema (engineer-grade types) and casual_schema (everything wide). Then explain why the gap matters even before indexes.
Input.
| Column | production_schema | casual_schema |
|---|---|---|
| event_id | BIGINT (8 B) | NUMERIC(38) (~17 B) |
| user_id | INT (4 B) | BIGINT (8 B) |
| event_type | SMALLINT (2 B, lookup) | VARCHAR(64) (~24 B avg) |
| amount_cents | INT (4 B) | NUMERIC(19, 4) (~10 B) |
| event_ts | TIMESTAMPTZ (8 B) | TIMESTAMP (8 B) |
| payload | JSONB (~120 B avg) | JSON text (~180 B avg) |
Code (Postgres).
-- production schema
CREATE TABLE events_prod (
event_id BIGINT NOT NULL,
user_id INT NOT NULL,
event_type SMALLINT NOT NULL,
amount_cents INT NOT NULL,
event_ts TIMESTAMPTZ NOT NULL,
payload JSONB NOT NULL,
PRIMARY KEY (event_id)
);
-- casual schema
CREATE TABLE events_casual (
event_id NUMERIC(38) NOT NULL,
user_id BIGINT NOT NULL,
event_type VARCHAR(64) NOT NULL,
amount_cents NUMERIC(19, 4) NOT NULL,
event_ts TIMESTAMP NOT NULL,
payload JSON NOT NULL,
PRIMARY KEY (event_id)
);
Step-by-step explanation.
-
Per-row width. Sum the per-column byte estimates above.
production_schema≈ 146 B per row;casual_schema≈ 247 B per row. -
Row overhead. Add ~24 B of Postgres tuple header per row.
production_schema≈ 170 B;casual_schema≈ 271 B. -
Billion-row total (table only). Multiply by 1,000,000,000.
production_schema≈ 170 GB;casual_schema≈ 271 GB. A 100 GB difference per environment before indexes and replicas. -
Index amplification. A primary-key index on a
NUMERIC(38)column is ~2× the size of one on aBIGINT. Secondary indexes that touch the wide columns inherit the same penalty. - Cache density. A 24 KB Postgres page holds ~144 production-rows but only ~88 casual-rows. Every range scan touches ~1.6× more pages in the casual schema, which doubles cache pressure under read load.
Output.
| Lever | production_schema | casual_schema | Ratio |
|---|---|---|---|
| Per-row width (data + header) | 170 B | 271 B | 1.6× |
| 1 B rows (table only) | ~170 GB | ~271 GB | 1.6× |
| Primary-key index (rough) | ~20 GB | ~40 GB | 2.0× |
| Rows per 24 KB page | ~144 | ~88 | 0.6× |
Pages scanned for count(*)
|
~7 M | ~11 M | 1.6× |
Rule of thumb. At design time, treat type width like cost-per-row times row-count. A 100-byte mistake at a billion rows is 100 GB on disk, doubled across primary and replica clusters. Choose the narrowest type the domain allows — BIGINT over NUMERIC(38), INT over BIGINT if 2 B is enough, SMALLINT over INT for lookups.
SQL interview question on type selection at scale
A senior question in a schema-design round often sounds like: "We're modelling a payments event table that will hit a billion rows in the next 18 months. Walk me through how you'd pick the type for each column and what guardrails you'd add." It probes whether the candidate connects the types of sql data types to storage, planner cost, and operational pain in the same sentence.
Solution Using narrow exact types plus explicit constraints
CREATE TABLE payments (
payment_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id BIGINT NOT NULL,
merchant_id INT NOT NULL,
amount_cents BIGINT NOT NULL CHECK (amount_cents >= 0),
currency CHAR(3) NOT NULL, -- ISO-4217
status SMALLINT NOT NULL, -- FK to status lookup
occurred_at TIMESTAMPTZ NOT NULL DEFAULT now(),
metadata JSONB NOT NULL DEFAULT '{}'::jsonb
);
CREATE INDEX payments_user_occurred_idx
ON payments (user_id, occurred_at DESC);
CREATE INDEX payments_metadata_gin_idx
ON payments USING GIN (metadata);
Step-by-step trace.
| Column | Choice | Why narrower won |
|---|---|---|
| payment_id | BIGINT identity | 8 B; survives 9 × 10^18 rows; planner uses native compare |
| user_id | BIGINT | room for global user IDs; matches the auth service width |
| merchant_id | INT | 4 B is enough — merchant count fits in 2 B already |
| amount_cents | BIGINT | cents avoid FLOAT; BIGINT (8 B) covers $9.2 × 10^16 |
| currency | CHAR(3) | fixed three-letter ISO code, zero waste |
| status | SMALLINT | lookup table; never more than a few hundred statuses |
| occurred_at | TIMESTAMPTZ | UTC under the hood; renders in app TZ |
| metadata | JSONB | binary, indexable, evolving |
Output:
| Metric | Value | Note |
|---|---|---|
| Per-row width (data + header) | ~92 B | tight; payload sits at ~50 B median |
| 1 B rows (table only) | ~92 GB | budget-friendly |
| Index footprint | ~30 GB | PK + composite + GIN |
| Read pattern | user + time | composite index serves user dashboards |
| Failure mode | overflow / truncation |
BIGINT and CHAR(3) both error cleanly |
Why this works — concept by concept:
-
Narrow exact types — every column uses the smallest exact type the domain allows.
BIGINTfor IDs and money-as-cents,INTfor foreign keys,SMALLINTfor enums,CHAR(3)for ISO codes. NoNUMERIC(38), no wideVARCHARunless the domain needs it. -
Cents over FLOAT — storing money as integer cents (or
NUMERIC(19, 4)if sub-cent precision matters) avoids IEEE-754 rounding. The application formats display values; the database stores facts. - TIMESTAMPTZ — Postgres stores UTC internally and renders in the session time zone. Same row, different time zones, consistent moment-in-time semantics.
- JSONB — binary storage with key parsing on write, indexable via GIN. The right choice for evolving payloads where most fields are not yet promoted to columns.
-
Constraints —
NOT NULL,CHECK (>=0), foreign-key references onmerchant_idandstatuskeep bad rows out before the application even sees them. - Cost — write latency = O(per-row fixed-width append); read latency = O(log N) on the composite index; storage = O(rows × 92 B + indexes).
Worked example — the implicit-cast trap that kills index usage
Detailed explanation. A subtle but expensive consequence of mismatched sql types is implicit casting at query time. When the application sends a string literal against an integer column, the planner has to coerce one side or the other. In many dialects (MySQL is the prototypical offender), the cast happens on the column — which means a B-tree index on that column is no longer usable. The query falls back to a sequential scan, and a 5 ms point lookup becomes a 5-second table sweep at scale.
Question. Given orders(user_id BIGINT, status SMALLINT) with a B-tree index on user_id, explain why WHERE user_id = '42' is fast in some dialects and slow in others, and write the fix that works everywhere.
Input.
| Dialect | Predicate | Behaviour |
|---|---|---|
| Postgres | WHERE user_id = '42' |
parses '42' as text; planner casts '42' to BIGINT; index used |
| MySQL | WHERE user_id = '42' |
same predicate; planner casts column to VARCHAR; index not used |
| SQL Server | WHERE user_id = '42' |
similar to MySQL — implicit cast can disable the index |
| Universal fix | WHERE user_id = 42 |
literal is already the right type; index used everywhere |
Code.
-- Slow path on MySQL — implicit column cast disables the index
EXPLAIN SELECT * FROM orders WHERE user_id = '42';
-- → type: ALL, rows: 1_000_000 (full table scan)
-- Fast path everywhere
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
-- → type: ref, rows: 8 (index lookup)
-- If the literal must be a string (e.g. from a parameterised ORM), cast it
EXPLAIN SELECT * FROM orders WHERE user_id = CAST('42' AS UNSIGNED);
-- → type: ref, rows: 8 (index lookup again)
Step-by-step explanation.
-
Predicate type matching. The planner walks
user_id = '42'and notices the types differ (BIGINTon the left,VARCHARon the right). Some dialects coerce the literal; some coerce the column. -
Column-side coercion is the killer. Wrapping the column in a cast (
CAST(user_id AS VARCHAR) = '42') makes the B-tree index inapplicable — the index is onBIGINTvalues, not on theirVARCHARprojection. - MySQL's particular behaviour. Historically, MySQL preferred to cast the integer column to a string when comparing against a string literal — which silently disabled the index. Modern MySQL is better, but the rule still bites in many setups.
-
The fix. Always send the literal in the column's type. ORMs typically do this if you pass
42(int) rather than"42"(string). Parameterised drivers do it when the bind parameter is declared as the right SQL type. -
Diagnostic move. When a query is mysteriously slow, run
EXPLAIN. Atype: ALLrow count equal to the table size is the implicit-cast smell.
Output.
| Query | Plan | Rows scanned |
|---|---|---|
WHERE user_id = '42' (MySQL) |
full table scan | ~1,000,000 |
WHERE user_id = 42 (every dialect) |
index lookup | ~8 |
WHERE user_id = CAST('42' AS UNSIGNED) |
index lookup | ~8 |
Rule of thumb. Strings bind to string columns; integers bind to integer columns. Never let your ORM or driver decide. When a query is slow, EXPLAIN first and check the predicate types second — implicit casts are the #1 reason a "well-indexed" query takes a second.
SQL
Topic — SQL
SQL schema-design and type-selection problems
2. Numeric types — INT, NUMERIC, FLOAT, DOUBLE
Numeric columns split into two families — exact integers and approximate floats — and a third (NUMERIC) that is "exact, slow, and unbounded"
The mental model in one line: integers and NUMERIC are exact, FLOAT and DOUBLE are approximate — the database does not let you mix them without telling you what was rounded. Once you know which family a column belongs to, every sql data types decision in the numeric column family collapses to two questions: "what's the range?" and "do I need exact arithmetic?"
The INT family in five bullets.
-
SMALLINT— 2 bytes, range-32,768to32,767. Lookups, small enums, weekday flags. Underused in modern schemas; reach for it when the domain genuinely fits. -
INT(a.k.a.INTEGER) — 4 bytes, range-2,147,483,648to2,147,483,647. The default for counts, foreign keys, anything under ~2.1 B. Falls over silently if your sequence outgrows it. -
BIGINT— 8 bytes, range~±9.2 × 10^18. The default for surrogate primary keys in any system that might ever exceed 2 B rows. Always useBIGINTfor event tables. -
SERIAL/BIGSERIAL(Postgres legacy) →GENERATED ALWAYS AS IDENTITYis the modern equivalent; safer, follows the SQL standard. -
Auto-promotion is not free. Migrating an
INTprimary key toBIGINTon a billion-row table is an offline operation in most production databases. ChooseBIGINTup front for surrogate keys.
NUMERIC(p, s) and DECIMAL — exact arithmetic.
-
NUMERIC(p, s)stores up toptotal digits withsdigits after the decimal point.NUMERIC(19, 4)covers all money values you will ever encounter in a single ledger. - Arithmetic is exact —
0.1 + 0.2 = 0.3exactly. No IEEE-754 rounding. - Storage is variable — Postgres uses ~2 bytes per 4 decimal digits plus overhead. A
NUMERIC(19, 4)averages ~10 bytes. - Performance is slower than fixed-width integers — the engine cannot vectorise variable-length comparisons as cheaply.
-
DECIMALis the SQL-standard synonym forNUMERICin most dialects. They are interchangeable.
FLOAT, DOUBLE PRECISION, REAL — approximate arithmetic.
- IEEE-754 single (
REAL/FLOAT(24), 4 bytes) and double (DOUBLE PRECISION/FLOAT(53), 8 bytes) precision. -
Never store money in
FLOATorDOUBLE.0.1 + 0.2 ≠ 0.3at the binary level; aggregate sums drift; audits fail. - Good for sensor telemetry, scientific data, analytics where a known relative tolerance is acceptable.
- Equality on
FLOATis essentially never the right operator. Useabs(a - b) < epsiloninstead.
BOOLEAN — small, special, dialect-dependent.
-
Postgres: native 1-byte
booleanwithtrue/false/nullsemantics. -
MySQL: there is no real
BOOLEAN.BOOLandBOOLEANare aliases forTINYINT(1)— 0 or 1. -
Snowflake: native
BOOLEAN. -
SQL Server: no
BOOLEAN; useBIT. -
BigQuery: native
BOOL. -
Portability tip: when modelling for cross-dialect deployment, use a
BOOLEAN-named column and let the migration tool map it; don't hard-codeTINYINT(1).
Choosing between BIGINT and NUMERIC for counters.
-
Counter that increments by 1 —
BIGINT. 8 bytes, fast compare, no rounding ever. -
Counter that accumulates fractional amounts —
NUMERIC(p, s). Decidepbased on the lifetime sum,sbased on the smallest unit you have to track. -
Counter that holds aggregate sensor totals —
BIGINTif you can keep the unit small enough (microseconds, nanograms);NUMERICif you need arbitrary fractional precision.
Worked example — pick the right numeric type for a payments ledger
Detailed explanation. A payments ledger is the canonical exact-arithmetic surface. The natural columns are amount (the transaction value), fee_amount (a derived value), and running_balance (a per-account aggregate). The right type for each is NUMERIC(19, 4) — exact, big enough for global ledgers, small enough to index efficiently. The wrong type for any of them is FLOAT, and the demo below shows exactly why.
Question. Two engineers propose competing types for a payments table — Engineer A uses NUMERIC(19, 4), Engineer B uses DOUBLE PRECISION. After 100 microtransactions of $0.10 each, what does each schema report for SUM(amount), and which one matches the auditor's expectation of $10.00?
Input.
| Engineer | Column | Type | Inserted value |
|---|---|---|---|
| A | amount | NUMERIC(19, 4) | 0.1000 × 100 rows |
| B | amount | DOUBLE PRECISION | 0.1 × 100 rows |
Code (Postgres).
-- exact ledger
CREATE TABLE ledger_a (id BIGINT PRIMARY KEY, amount NUMERIC(19, 4));
INSERT INTO ledger_a SELECT g, 0.10 FROM generate_series(1, 100) g;
-- float ledger
CREATE TABLE ledger_b (id BIGINT PRIMARY KEY, amount DOUBLE PRECISION);
INSERT INTO ledger_b SELECT g, 0.10 FROM generate_series(1, 100) g;
SELECT 'A' AS who, SUM(amount) AS total FROM ledger_a
UNION ALL
SELECT 'B', SUM(amount) FROM ledger_b;
Step-by-step explanation.
-
Insert. Each row inserts
0.10. Ledger A stores the exact decimal0.1000; ledger B stores the closestDOUBLEto0.1, which is0.1000000000000000055511151231257827021181583404541015625. -
Sum. Ledger A's
SUMadds 100 exact0.1000values →10.0000. Ledger B'sSUMadds 100 approximate values; the rounding errors accumulate. -
Compare. Ledger A reports
10.0000. Ledger B reports something like10.000000000000002— visibly wrong to the auditor. -
Why does this matter? The application probably formats both as
$10.00on screen. But the underlying value in ledger B is not$10.00— it is "as close to$10.00asDOUBLEcan express." Any equality check (WHERE total = 10.00) will fail; any aggregation across many such sums will drift further.
Output.
| Ledger | Column type | SUM(amount) | Audit pass? |
|---|---|---|---|
| A | NUMERIC(19, 4) | 10.0000 | ✓ |
| B | DOUBLE PRECISION | ~10.000000000000002 | ✗ |
Rule of thumb. Money is NUMERIC(p, s). The p and s are set by the largest expected aggregate and the smallest unit you track. For most consumer applications NUMERIC(19, 4) is the right answer (15 digits before the decimal, 4 after). For high-frequency trading, push s to 8.
Worked example — SMALLINT for a status enum saves real bytes at scale
Detailed explanation. A status column on a high-volume table — order status, event status, request status — is often modelled as a VARCHAR(20) containing values like pending, confirmed, shipped. The right type, in almost every dialect, is SMALLINT plus a lookup table. The string form costs ~24 bytes per row for the average 8-character status (4 B length-prefix + content + alignment); the SMALLINT form costs 2 bytes. On a billion-row table that is 22 GB saved before any index touches the column.
Question. Compare three ways to model an order status: VARCHAR(20), CHAR(10), and SMALLINT with a status_lookup table. Show the storage and the query cost for "give me the order count by status."
Input (the three schemas).
| Schema | Column | Type | Bytes per row |
|---|---|---|---|
| A | status | VARCHAR(20) | ~12 B (avg "shipped" = 7 chars + header) |
| B | status | CHAR(10) | 10 B (always; padded) |
| C | status_id | SMALLINT | 2 B |
Code.
-- Schema C — lookup table + SMALLINT FK
CREATE TABLE order_status_lookup (
status_id SMALLINT PRIMARY KEY,
status_name TEXT NOT NULL UNIQUE
);
INSERT INTO order_status_lookup VALUES
(1, 'pending'), (2, 'confirmed'), (3, 'shipped'),
(4, 'delivered'), (5, 'cancelled');
CREATE TABLE orders_c (
order_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
status_id SMALLINT NOT NULL REFERENCES order_status_lookup(status_id),
placed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Count by status (Schema C)
SELECT s.status_name, COUNT(*) AS n
FROM orders_c o
JOIN order_status_lookup s USING (status_id)
GROUP BY s.status_name
ORDER BY n DESC;
Step-by-step explanation.
- Lookup table. Five rows; ~50 bytes total. Lives in the buffer cache forever; the join is effectively free.
-
SMALLINT FK. Each
orders_crow stores a 2-byte integer instead of an 8-12 byte string. On a billion rows that is 6-10 GB saved on the table alone. -
Index footprint. A B-tree index on
status_idis also 2 bytes per entry plus overhead. A B-tree onVARCHAR(20)is 8-12 bytes per entry plus the same overhead. Pull-through savings are 3-6×. -
Query shape. The
JOINto the lookup table is a hash join with five entries on the build side — it does not show up inEXPLAIN ANALYZEtime profiles. - Display friendliness. Reports and dashboards see the textual status because of the join. The wire format stays human-readable; the storage stays compact.
Output (billion-row estimate, table only).
| Schema | Bytes per row | Total (1 B rows) | Index (rough) |
|---|---|---|---|
| A — VARCHAR(20) | ~12 | ~12 GB | ~14 GB |
| B — CHAR(10) | 10 | ~10 GB | ~12 GB |
| C — SMALLINT + lookup | 2 + lookup ~50 B | ~2 GB | ~3 GB |
Rule of thumb. Whenever a column has a closed, small set of values (≤ ~32k), reach for SMALLINT plus a lookup table. The storage is 4-6× tighter, the index is just as small, and the lookup table makes the enum self-documenting in the database itself.
SQL interview question on choosing the integer width
A common probe in a schema-review round sounds like: "We have a fast-growing event table — how do you pick between INT and BIGINT for the primary key?" It tests whether the candidate connects the column type to the lifetime of the table, not just the row count today.
Solution Using BIGINT identity with explicit overflow guard
CREATE TABLE events (
event_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id BIGINT NOT NULL,
occurred TIMESTAMPTZ NOT NULL DEFAULT now(),
payload JSONB NOT NULL
);
-- Cheap overflow probe — runs in O(1)
SELECT
pg_sequence_last_value('events_event_id_seq') AS current_id,
9223372036854775807 AS bigint_max,
9223372036854775807
- pg_sequence_last_value('events_event_id_seq') AS headroom;
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | BIGINT GENERATED ALWAYS AS IDENTITY |
8-byte width; covers 9.2 × 10^18 rows; modern SQL-standard syntax |
| 2 | Each insert | sequence advances by 1; row gets the next ID, atomic across sessions |
| 3 | At 2 B rows | an INT schema would have already overflowed; BIGINT reports 2,000,000,000 with ~9.2 × 10^18 of headroom |
| 4 | pg_sequence_last_value |
constant-time check; safe to wire into a monitoring dashboard |
| 5 | At 9 × 10^18 rows | still healthy; in practice you will retire the table long before this |
Output:
| Column | Type | Width | Range | Safe-for |
|---|---|---|---|---|
| event_id | BIGINT | 8 B | ~±9.2 × 10^18 | event tables, audit logs |
| user_id | BIGINT | 8 B | global ID space | matches auth service |
| occurred | TIMESTAMPTZ | 8 B | year 4713 BC to 294276 AD | every transaction |
| payload | JSONB | variable | binary JSON | indexable via GIN |
Why this works — concept by concept:
-
BIGINT for surrogate keys — 8 bytes is the cheapest type that survives the lifetime of any high-throughput table. Migrating from
INTtoBIGINTafter the fact is an offline operation in most production databases. -
GENERATED ALWAYS AS IDENTITY — the SQL-standard replacement for
SERIAL. Locked-down by default (cannot be overridden byINSERT … VALUES (id, …)), which prevents accidental insertions that skip the sequence. -
Headroom monitoring —
pg_sequence_last_valuelets ops paint a single number on a dashboard: "current ID / max ID." Linear extrapolation gives a calendar date for the overflow event, decades out forBIGINT. -
Width discipline — every fixed-width 8-byte column lines up cleanly in the row, which keeps the page format efficient.
NUMERIC(38)would have varied per row and inflated index size by ~2×. -
Cost — insert latency = O(1) sequence advance; storage =
8 B per row + index entry; planner uses native integer compare on every join.
SQL
Topic — type-casting
Type-casting problems (SQL)
3. String types — CHAR, VARCHAR, TEXT, encoding
Strings split into fixed-width, length-bounded, and unbounded — pick by predictability, not by habit
The mental model in one line: CHAR(N) is fixed-width with space padding, VARCHAR(N) is variable with a hard upper bound, TEXT is variable with no upper bound — and in modern Postgres they all share the same storage path. Once you internalise that the sql types for strings differ in semantics (the length check), not in storage (the bytes on disk), the noisy "VARCHAR(255) is faster than TEXT" arguments collapse to dialect-specific footnotes.
CHAR(N) — fixed-width, space-padded.
- Always stores exactly
Ncharacters; pads with trailing spaces if the value is shorter. - Useful when the domain is truly fixed — ISO currency codes (
CHAR(3)), US state codes (CHAR(2)), fixed-width hashes (CHAR(64)for SHA-256 hex). -
Pitfall: comparison ignores trailing spaces in some dialects (
'US' = 'US 'is true in SQL ServerCHAR), but inequality andLIKEmay not. Always document the semantics. - Use sparingly. In 2026 schemas,
CHARis rare outside genuinely fixed-width codes.
VARCHAR(N) — variable, length-checked.
- Stores up to
Ncharacters; uses one or two length-prefix bytes plus the actual data. -
The length is a constraint, not a storage hint. Postgres reserves the same space for
VARCHAR(255)andVARCHAR(1000)— both grow as the data needs. MySQL, in contrast, treats the declared length as part of the row width up to 65,535 bytes (the row-size limit). -
VARCHAR(255)myth. In Postgres there is no performance gain fromVARCHAR(255)vsVARCHAR(1000)vsTEXT. The myth comes from the MySQL row-width limit, where shorterVARCHARs leave room for more columns. In Postgres,TEXTis the safe default. -
Choose
Nfrom the domain. If the maximum legal email length is 320, declareVARCHAR(320)— neverVARCHAR(255)"because everyone does." The length constraint catches bad input at the database boundary.
TEXT — variable, unbounded.
- No declared upper bound. Postgres stores small
TEXTinline and largeTEXTout-of-row in TOAST (The Oversized-Attribute Storage Technique). - Same storage path as
VARCHARin Postgres —TEXTis the idiomatic Postgres default for any string column. - SQL Server:
TEXTis deprecated; useVARCHAR(MAX)orNVARCHAR(MAX). - Oracle:
CLOBfor character large objects; very different storage semantics. - MySQL:
TEXT,MEDIUMTEXT,LONGTEXTfor progressively larger limits (64 KB / 16 MB / 4 GB).
Encoding — UTF-8 is the default; the pitfalls are real.
-
UTF-8 is variable-byte. Each character takes 1 to 4 bytes. A
VARCHAR(255)can hold 255 characters, not bytes — but the row-byte limit in MySQL still applies (so autf8mb4VARCHAR(255)reserves255 × 4 = 1020row bytes for size calculations). -
utf8≠utf8mb4in MySQL. Historicallyutf8in MySQL was a 3-byte subset that could not store emojis or many CJK characters. Always useutf8mb4(the 4-byte variant) —utf8is essentially a legacy alias. -
Collation matters.
utf8mb4_general_ciis case-insensitive and accent-insensitive;utf8mb4_binis byte-wise. Sort order andWHERE name = 'café'semantics depend on the collation. Always pick deliberately. -
Silent truncation. With MySQL
sql_modenot includingSTRICT_TRANS_TABLES, inserting a string longer than the column will silently truncate. Postgres always errors. SQL Server errors unlessANSI_WARNINGS OFF.
TEXT vs VARCHAR(MAX) vs CLOB — dialect cheat sheet.
-
Postgres —
TEXTis canonical. UseVARCHAR(N)only whenNis genuinely a constraint. -
MySQL —
VARCHAR(N)for bounded strings (up to ~21,000 characters withutf8mb4and a near-empty row),TEXT/MEDIUMTEXT/LONGTEXTfor unbounded. -
SQL Server —
VARCHAR(N)up to 8000,VARCHAR(MAX)for larger. UseNVARCHARif you need full Unicode (UTF-16). -
Oracle —
VARCHAR2(N)for ≤ 4000 bytes (or 32767 withMAX_STRING_SIZE=EXTENDED),CLOBfor larger. -
Snowflake —
VARCHAR(N)andSTRINGare aliases; both are unlimited;Nis enforced as a length constraint only. -
BigQuery —
STRING(unlimited) andBYTES; noVARCHAR.
Worked example — pick string types for a user-profile table
Detailed explanation. A user-profile table mixes truly fixed-width fields (country_code), bounded fields (email), and unbounded fields (bio). The right type per column is determined by the domain — not by tradition.
Question. Design a users table in Postgres that stores a 3-letter country code, an email address (max 320 chars per RFC 5321), a screen name (max 50 chars, enforced at the API), and a free-form bio of any length. Show the schema, then a row that exercises each column.
Input.
| Column | Domain | Right type |
|---|---|---|
| country_code | ISO-3166 alpha-3 (always 3 letters) | CHAR(3) |
| RFC 5321 max 320 | VARCHAR(320) | |
| screen_name | product rule: ≤ 50 chars | VARCHAR(50) |
| bio | free-form, can hold emoji | TEXT |
Code.
CREATE TABLE users (
user_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
country_code CHAR(3) NOT NULL,
email VARCHAR(320) NOT NULL UNIQUE,
screen_name VARCHAR(50) NOT NULL,
bio TEXT NOT NULL DEFAULT ''
);
INSERT INTO users (country_code, email, screen_name, bio)
VALUES
('USA', 'alice@example.com', 'alice', 'data engineer · ❤ sql'),
('IND', 'bob@example.com', 'bob42', ''),
('GBR', 'carol@example.com', 'carol-c', 'I write about systems.');
Step-by-step explanation.
-
country_code CHAR(3)— fixed-width because every ISO-3166 alpha-3 code is exactly 3 letters. Storage is 3 bytes plus a tiny overhead. Reads always return exactly 3 characters; no padding surprises because the domain is already 3. -
email VARCHAR(320)— bounded by the RFC. The database enforces the upper bound, which catches bad input at the boundary rather than in the application. UTF-8 friendly (most emails are ASCII; the column also handles internationalized addresses). -
screen_name VARCHAR(50)— bounded by the product rule. If the rule changes, the migration is a singleALTER COLUMN; until then, no surprises in the schema. -
bio TEXT— unbounded, supports emoji, handled identically toVARCHAR(50)under the hood in Postgres. Small values stay inline; very long values are TOASTed out-of-row. -
Insert exercise — three rows demonstrate that
biocan hold the emoji❤(UTF-8 multi-byte) without any column-width juggling.
Output.
| user_id | country_code | screen_name | bio | |
|---|---|---|---|---|
| 1 | USA | alice@example.com | alice | data engineer · ❤ sql |
| 2 | IND | bob@example.com | bob42 | |
| 3 | GBR | carol@example.com | carol-c | I write about systems. |
Rule of thumb. Use CHAR(N) only when the domain is truly fixed-width. Use VARCHAR(N) to encode a real domain constraint that you want the database to enforce. Use TEXT for genuinely unbounded strings. Never use VARCHAR(255) as a default just because the textbook example used it.
Worked example — MySQL silent truncation vs Postgres strict failure
Detailed explanation. A team migrates a VARCHAR(20) column from MySQL to Postgres. The source data has a few rows where a typo extended the value to 30 characters. MySQL accepted them with silent truncation; Postgres refuses, raising value too long for type character varying(20). This is the canonical "dialect bites you" story for data datatype in sql sizing — and the right fix is rarely "make the column wider," it is "decide what the domain really is."
Question. Given a VARCHAR(20) column receiving a 30-character value, show what MySQL with sql_mode not including STRICT_TRANS_TABLES does, what Postgres does, and the right way to handle the cross-dialect migration.
Input.
| Source | Column | Value being inserted | Length |
|---|---|---|---|
| MySQL (sql_mode lax) | name VARCHAR(20) | 'Alessandra-Maria Rodriguez' | 27 |
| Postgres | name VARCHAR(20) | 'Alessandra-Maria Rodriguez' | 27 |
Code.
-- MySQL — silent truncation when sql_mode is lax
CREATE TABLE customers_my (id BIGINT PRIMARY KEY, name VARCHAR(20));
INSERT INTO customers_my VALUES (1, 'Alessandra-Maria Rodriguez');
SELECT id, name, length(name) FROM customers_my;
-- → (1, 'Alessandra-Maria Rod', 20) ⚠️ silently truncated
-- Postgres — clean error
CREATE TABLE customers_pg (id BIGINT PRIMARY KEY, name VARCHAR(20));
INSERT INTO customers_pg VALUES (1, 'Alessandra-Maria Rodriguez');
-- → ERROR: value too long for type character varying(20)
Step-by-step explanation.
-
MySQL lax mode. Without
STRICT_TRANS_TABLES, the server treats the excess characters as "a warning, not an error." The truncated value is committed. The application receives a successfulINSERTand never sees the warning unless it pollsSHOW WARNINGS. -
Postgres strict-always. Postgres raises
22001 string_data_right_truncationand rolls back the transaction. The application sees the failure immediately. - Migration discovery. The migration tool pulls one row at a time and inserts into Postgres; the first oversized row fails the entire batch. The team now sees the data is wider than the schema claims.
-
The right fix. Two options. Either the domain genuinely needs more room (then
ALTER COLUMN name TYPE VARCHAR(64)or move toTEXT), or the wider values were typos that should have been caught at the source (then clean the data and add an application-side check). -
The wrong fix. "Just remove the length constraint" — turning every
VARCHAR(20)intoTEXTremoves the database's ability to catch bad input at the boundary. The fix should be domain-driven, not error-driven.
Output.
| Database | Insert behaviour | Stored value | App observation |
|---|---|---|---|
| MySQL (lax) | succeeds with warning | 'Alessandra-Maria Rod' | silent data loss |
| MySQL (STRICT) | errors | (no insert) | clear failure |
| Postgres | errors | (no insert) | clear failure |
Rule of thumb. Run MySQL with sql_mode including STRICT_TRANS_TABLES. Always. The data you don't catch at the boundary is the data that shows up in a Monday-morning incident.
SQL interview question on string-type pitfalls
A senior probe in a schema review often sounds like: "Why is VARCHAR(50) the wrong default for a column that holds product descriptions across English, Spanish, and Japanese?" The interviewer is checking whether the candidate knows that UTF-8 length is in characters (or bytes, depending on dialect), and that fifty characters in Japanese is more storage than fifty in English.
Solution Using TEXT plus an application-side validator and a CHECK constraint
CREATE TABLE products (
product_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
sku CHAR(8) NOT NULL UNIQUE,
description TEXT NOT NULL,
CHECK (char_length(description) BETWEEN 1 AND 500)
);
INSERT INTO products (sku, description) VALUES
('SKU00001', 'Bluetooth headphones with noise cancellation'),
('SKU00002', 'Auriculares Bluetooth con cancelación de ruido'),
('SKU00003', 'ノイズキャンセリング機能付き Bluetooth ヘッドフォン');
Step-by-step trace.
| Row | Language | char_length(description) | byte_length | Stored in TEXT? |
|---|---|---|---|---|
| 1 | English | 47 | 47 | ✓ |
| 2 | Spanish | 46 | 47 (one accented char ≈ 2 B) | ✓ |
| 3 | Japanese | 30 | 70 (each kanji ≈ 3 B in UTF-8) | ✓ |
Output:
| product_id | sku | description |
|---|---|---|
| 1 | SKU00001 | Bluetooth headphones with noise cancellation |
| 2 | SKU00002 | Auriculares Bluetooth con cancelación de ruido |
| 3 | SKU00003 | ノイズキャンセリング機能付き Bluetooth ヘッドフォン |
Why this works — concept by concept:
-
TEXT in Postgres — same storage path as
VARCHAR, no need to pick a magic number for the byte-width of the longest realistic value. The CHECK constraint enforces a character-count bound that is locale-aware. -
char_length vs octet_length —
char_lengthreturns the number of code points;octet_lengthreturns the number of bytes. Constraints should usually be expressed in characters so the same rule applies regardless of language. -
CHAR(8) for SKUs — fixed-width because every SKU is exactly eight characters. Indexes on
CHAR(N)are slightly more compact than onVARCHAR(N). - UTF-8 awareness — UTF-8 stores ASCII in 1 byte, Latin extended in 2, BMP in 3, supplementary plane (including most emoji) in 4. The schema never has to know — but you do, when sizing storage budgets.
-
Cost — storage =
9 B (header) + actual UTF-8 bytesper row; index size depends on the language mix in real data.
SQL
Topic — string functions
String manipulation problems (SQL)
4. Date and time types — DATE, TIME, TIMESTAMP, TIMESTAMPTZ, INTERVAL
data datatype in sql for time is five distinct primitives — pick by what the column represents, not what the wire format looks like
The mental model in one line: a DATE is a calendar day, a TIME is a wall-clock, a TIMESTAMP is a calendar day plus a wall-clock with no zone, a TIMESTAMPTZ is a fully-specified moment in UTC, and an INTERVAL is a duration with no anchor. Once you can name what each one means, every "store local time? UTC? both?" debate collapses into a single rule: store UTC, present in app TZ.
DATE — a calendar day, no time.
- 4 bytes in Postgres; 3 in MySQL. Range from
4713-01-01 BCto5874897-12-31 ADin Postgres. - The right type for birthdays, expiration dates, "the day the order was placed," and any value where the time of day is genuinely unknown or irrelevant.
- No time-zone semantics — a
DATEis the same calendar value everywhere on Earth.
TIME (TIME WITHOUT TIME ZONE) — a wall-clock, no date.
- 8 bytes in Postgres. Range
00:00:00to24:00:00. - Useful for "store opens at 09:00" — a daily schedule that does not anchor to a specific calendar day.
- Rare in modern schemas; usually you want a
TIMESTAMPfor the actual event and aTIMEonly for the recurring pattern.
TIMESTAMP (TIMESTAMP WITHOUT TIME ZONE) — date + time, no zone.
- 8 bytes. Stores the value literally as written; no conversion on read or write.
-
Almost always the wrong choice for production data. Without a zone, the same column value (
2026-06-02 09:00:00) could mean "9am in New York" or "9am in Tokyo" — the database cannot tell. - Useful only when the value is genuinely zoneless — for example, "this report covers the working hours 09:00-17:00 in whichever timezone the user is in."
TIMESTAMPTZ (TIMESTAMP WITH TIME ZONE) — the production default.
- 8 bytes. Postgres stores it as UTC internally; renders it in the session time zone on read.
- The right type for "when did the event happen?" — a single global moment that can be presented in any time zone.
-
MySQL's
TIMESTAMP≠ Postgres'TIMESTAMP. MySQL'sTIMESTAMPalready converts to UTC on write and back to the session zone on read; MySQL'sDATETIMEis the literal "no zone" value. Mind the dialect. -
Snowflake has
TIMESTAMP_NTZ(no zone),TIMESTAMP_LTZ(local — same idea as Postgres' TZTZ), andTIMESTAMP_TZ(literal stored zone).TIMESTAMP_LTZis the closest equivalent to Postgres'TIMESTAMPTZfor moment-in-time semantics.
INTERVAL — the underrated type (Postgres-only in this strong form).
- Stores a duration:
INTERVAL '7 days 3 hours',INTERVAL '90 seconds',INTERVAL '1 year'. - 16 bytes in Postgres. Supports natural arithmetic:
now() + INTERVAL '7 days',event_ts - other_event_ts. -
INTERVAL '1 day'≠INTERVAL '24 hours'— across a DST transition, "1 day" is 23 or 25 hours of wall-clock time, while "24 hours" is always 24 elapsed hours. -
Not portable. MySQL has
INTERVALonly as a literal in expressions, not as a column type. For cross-dialect deployment, store durations asBIGINTseconds (orBIGINTmilliseconds) and reconstruct.
Epoch / UNIX timestamp — when to use BIGINT instead.
- A
BIGINTUNIX-epoch second (or millisecond) value works in every dialect, never loses information across migrations, and indexes as a plain integer. - Trade-off: humans cannot read it. Always pair with a view or computed column that exposes
to_timestamp(epoch_seconds). - Use
BIGINTepoch when the storage layer is a NoSQL-flavoured warehouse that does not have first-classTIMESTAMPTZ, or when the row will be replicated across dialects.
Time-zone traps in five bullets.
- UTC vs local vs app TZ. Always store UTC; convert at the application edge; do not let any layer in the middle "helpfully" reinterpret.
-
DST 23/25-hour days. Daylight-saving transitions produce 23-hour and 25-hour calendar days.
INTERVAL '1 day'follows the calendar;INTERVAL '24 hours'follows the clock. Choose deliberately. -
Client-side TZ inference is fragile. A JavaScript
new Date()reads the OS time zone. Two users in two cities looking at the same record see two different rendered times. Always render server-side or pass an explicit zone. -
AT TIME ZONE. The SQL-standard operator:event_ts AT TIME ZONE 'UTC'converts aTIMESTAMPto a moment;event_ts AT TIME ZONE 'America/New_York'converts aTIMESTAMPTZback to a local wall-clock for display. - Leap seconds. Most databases (including Postgres) treat the day as exactly 86,400 seconds and "swallow" leap seconds by smearing or repeating. Almost never matters; if it does, you already know.
Worked example — design an events table that survives DST and multi-region reads
Detailed explanation. Two teams in two cities log to the same events table; users in three time zones read dashboards. The only sustainable design is: store UTC in a TIMESTAMPTZ, present in the user's preferred zone at render time. The demo below shows what the same row looks like under New York, London, and Tokyo presentations.
Question. A user in America/New_York posts a comment at "9:00 AM local" on 2026-06-02. Write an INSERT that stores the moment correctly in TIMESTAMPTZ, then write three SELECTs that render the same row as a wall-clock in New York, London, and Tokyo.
Input.
| User locale | Wall-clock | UTC equivalent |
|---|---|---|
| America/New_York (EDT, UTC-4) | 2026-06-02 09:00:00 | 2026-06-02 13:00:00 |
Code.
CREATE TABLE comments (
comment_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
body TEXT NOT NULL,
posted_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Insert as a literal in the user's zone — the DB converts to UTC
INSERT INTO comments (body, posted_at)
VALUES ('Hello from NYC',
TIMESTAMP '2026-06-02 09:00:00' AT TIME ZONE 'America/New_York');
-- Read in three different zones
SELECT body,
posted_at AT TIME ZONE 'America/New_York' AS nyc_wall_clock,
posted_at AT TIME ZONE 'Europe/London' AS london_wall_clock,
posted_at AT TIME ZONE 'Asia/Tokyo' AS tokyo_wall_clock
FROM comments;
Step-by-step explanation.
-
Insert. The literal
2026-06-02 09:00:00is interpreted in theAmerica/New_Yorkzone viaAT TIME ZONE. Postgres converts to UTC (subtract 4 hours during EDT) and stores2026-06-02 13:00:00+00. - Storage. On disk the value is 8 bytes representing the UTC moment. The original "9am NYC" framing is not stored — the moment is.
-
Read in NYC.
AT TIME ZONE 'America/New_York'converts UTC back to local wall-clock. Reader sees2026-06-02 09:00:00. -
Read in London. Same UTC moment, different rendering. EDT is UTC-4; BST is UTC+1; so the London wall-clock is
2026-06-02 14:00:00. -
Read in Tokyo. JST is UTC+9; the Tokyo wall-clock is
2026-06-02 22:00:00. -
DST robustness. Six months later (December) the same query in NYC will render
4:00 AM-ish if the same UTC instant is replayed — because EST is UTC-5, not UTC-4. The database doesn't care; the application reasoning stays the same.
Output.
| body | nyc_wall_clock | london_wall_clock | tokyo_wall_clock |
|---|---|---|---|
| Hello from NYC | 2026-06-02 09:00:00 | 2026-06-02 14:00:00 | 2026-06-02 22:00:00 |
Rule of thumb. TIMESTAMPTZ for every event-time column. Reads use AT TIME ZONE '<user-preferred-zone>'. Never store local time without an explicit zone column — and even then, prefer TIMESTAMPTZ and present locally.
Worked example — INTERVAL arithmetic vs raw seconds across DST
Detailed explanation. A scheduled job needs to add "one day" to a TIMESTAMPTZ. Across the spring-forward DST transition, INTERVAL '1 day' advances the calendar but only ~23 hours of wall-clock time; INTERVAL '24 hours' advances exactly 24 hours but lands on a different calendar day. Pick the wrong one and the job either runs an hour early, an hour late, or doubles up on the transition day.
Question. Given t = 2026-03-08 01:30:00-05 (Saturday night, just before US EDT begins at 2:00 AM on Sunday), compute t + INTERVAL '1 day' and t + INTERVAL '24 hours' and show how they differ.
Input.
| Variable | Value |
|---|---|
| t | 2026-03-08 01:30:00-05 (EST, UTC-5) |
| DST transition | 2026-03-08 02:00:00 EST → 03:00:00 EDT |
Code (Postgres).
WITH base AS (
SELECT TIMESTAMPTZ '2026-03-08 01:30:00 America/New_York' AS t
)
SELECT
t AS original,
t + INTERVAL '1 day' AS plus_one_day,
t + INTERVAL '24 hours' AS plus_24_hours,
(t + INTERVAL '1 day') AT TIME ZONE 'America/New_York' AS plus_one_day_nyc,
(t + INTERVAL '24 hours') AT TIME ZONE 'America/New_York' AS plus_24h_nyc
FROM base;
Step-by-step explanation.
-
Original moment.
2026-03-08 01:30:00 America/New_Yorkis EST (UTC-5), so the underlying UTC instant is2026-03-08 06:30:00+00. -
+ INTERVAL '1 day'. Postgres advances the calendar day by one. The result is2026-03-09 01:30:00 America/New_York, which in EDT (UTC-4) is the UTC instant2026-03-09 05:30:00+00— only 23 hours of elapsed time. -
+ INTERVAL '24 hours'. Postgres advances 24 hours of wall-clock elapsed time. The UTC instant becomes2026-03-09 06:30:00+00. Rendered in New York that is02:30:00 EDT— 25 wall-clock-minutes later on the calendar than the "1 day" form. -
Why does it matter? A subscription that renews at "the same wall-clock time tomorrow" should use
INTERVAL '1 day'. A token that expires "24 hours from issue" should useINTERVAL '24 hours'. They are genuinely different durations across DST. -
The wrong default. Storing a duration as
BIGINT secondsand addingsecondsalways advances 24h × N elapsed seconds. Useful for token expiry; wrong for "next renewal in the user's calendar."
Output.
| Expression | UTC instant | New York wall-clock |
|---|---|---|
| original | 2026-03-08 06:30:00+00 | 2026-03-08 01:30:00 EST |
| + INTERVAL '1 day' | 2026-03-09 05:30:00+00 | 2026-03-09 01:30:00 EDT |
| + INTERVAL '24 hours' | 2026-03-09 06:30:00+00 | 2026-03-09 02:30:00 EDT |
Rule of thumb. "Same time tomorrow in the user's calendar" → INTERVAL '1 day'. "Twenty-four hours from now" → INTERVAL '24 hours' or BIGINT seconds. Mixing the two across DST produces silent off-by-one-hour bugs.
SQL interview question on DATE vs TIMESTAMP vs TIMESTAMPTZ
A common probe in a senior schema-design round sounds like: "We have a subscriptions table. Should started_at be DATE, TIMESTAMP, or TIMESTAMPTZ? What if the product needs to send a renewal email on the anniversary?" It tests whether the candidate aligns the column type with the semantics the product cares about.
Solution Using TIMESTAMPTZ for the moment plus a generated DATE for the anniversary
CREATE TABLE subscriptions (
subscription_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id BIGINT NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
user_tz TEXT NOT NULL,
-- Anniversary day in the user's home timezone, computed from started_at
anniversary DATE GENERATED ALWAYS AS
((started_at AT TIME ZONE user_tz)::date) STORED
);
INSERT INTO subscriptions (user_id, started_at, user_tz)
VALUES
(101, TIMESTAMP '2025-06-02 09:00:00' AT TIME ZONE 'America/New_York', 'America/New_York'),
(102, TIMESTAMP '2025-06-02 23:30:00' AT TIME ZONE 'Asia/Tokyo', 'Asia/Tokyo');
Step-by-step trace.
| Row | started_at (UTC) | user_tz | (started_at AT TIME ZONE user_tz)::date | anniversary |
|---|---|---|---|---|
| 1 | 2025-06-02 13:00:00+00 | America/New_York | 2025-06-02 09:00:00 | 2025-06-02 |
| 2 | 2025-06-02 14:30:00+00 | Asia/Tokyo | 2025-06-02 23:30:00 | 2025-06-02 |
Output:
| subscription_id | user_id | started_at | user_tz | anniversary |
|---|---|---|---|---|
| 1 | 101 | 2025-06-02 13:00:00+00 | America/New_York | 2025-06-02 |
| 2 | 102 | 2025-06-02 14:30:00+00 | Asia/Tokyo | 2025-06-02 |
Why this works — concept by concept:
- TIMESTAMPTZ for the moment — the UTC instant is unambiguous; it survives every server move, every replica, every region.
- user_tz as a sibling column — the database keeps both "when did it happen?" (UTC) and "in what zone did the user mean it?" — the only durable representation of intent across DST shifts and migrations.
-
Generated DATE column —
GENERATED ALWAYS AS … STOREDcomputes the anniversary once per row, indexes cleanly, and stays consistent withstarted_atautomatically. The renewal job filters onWHERE anniversary = current_dateand ships emails on the right day for every user. - No app-side timezone math — the conversion logic lives in the database, where it can be reasoned about and tested. Application code reads the already-shaped column.
- Cost — generated columns add ~4 B per row; the saving in cross-team correctness debt is enormous.
SQL
Topic — date functions
Date and time problems (SQL)
5. Semi-structured types — JSON, JSONB, ARRAY, STRUCT
JSONB, ARRAY, and STRUCT are the bridge between rigid relational columns and "we don't know the shape yet" data
The mental model in one line: semi-structured types let one column hold a nested document, an ordered list, or a typed record — at the cost of slightly slower writes and a different indexing story. They are the right tool when the schema is genuinely evolving; they are the wrong tool when a field is accessed often, in which case it should be promoted to a top-level column.
JSON vs JSONB in Postgres.
-
JSONstores the value as text, parses on every read. Faithful to whitespace and key order; slow to query. -
JSONBstores the value as a binary tree, parsed once on write. Indexable with GIN, 5-10× faster on read; loses whitespace and may reorder keys. -
Production default:
JSONBunless you have a specific reason to preserve the exact text (audit trails, contract integrity, signature verification). -
MySQL has only
JSON, which is binary internally (closer to Postgres'JSONB). MariaDBJSONis text.
VARIANT in Snowflake.
- A single type that holds any JSON-like value. Snowflake automatically extracts column shards as it ingests, which lets the optimizer prune unused paths.
- Query with
column:keydot notation. ReturnsVARIANT; cast explicitly (::int,::string) before joining or aggregating.
ARRAY — homogeneous, ordered, indexable in some dialects.
-
Postgres — first-class
ARRAYfor any type, e.g.INT[],TEXT[],JSONB[]. Supportsunnest(arr)to explode into rows,arr @> ARRAY['x']for membership, GIN indexing for fast containment queries. -
BigQuery —
ARRAY<T>is first-class;UNNEST(arr)is the canonical explode operator. -
Snowflake —
ARRAYis an alias forVARIANTholding an ordered list;LATERAL FLATTEN(input => arr)to explode. -
Spark SQL —
ArrayType(T), explode withexplode(arr)orLATERAL VIEW EXPLODE(arr) t AS elem.
STRUCT / RECORD / MAP — typed nested records.
-
BigQuery —
STRUCT<name STRING, age INT64>for nested typed records; dot-accesst.user.name. Pairs well withARRAY<STRUCT<…>>. -
Snowflake —
OBJECTis the analogous container;OBJECT_CONSTRUCT(...)to build. -
Spark SQL —
StructType(StructField('name', StringType()), …); dot-accessdf.user.name. -
Postgres — composite types via
CREATE TYPE; less common in practice thanJSONB.
UNNEST patterns — the canonical explode operator.
-
SELECT unnest(tags) FROM posts(Postgres) → one row per array element. -
SELECT tag FROM posts, UNNEST(tags) AS tag(BigQuery) → one row per (post, tag) pair. -
SELECT t.value FROM posts, LATERAL FLATTEN(input => posts.tags) t(Snowflake) → same idea, slightly different syntax.
Indexing semi-structured data.
-
Postgres GIN on JSONB — supports
@>containment,?key existence,?|/?&multi-key existence. Beats every B-tree index on JSON access patterns. -
Postgres GIN on ARRAY — accelerates
@>and&&membership queries on text or integer arrays. - BigQuery JSON dot-notation — implicit indexing via column-pruning at the storage layer; no explicit index.
- Snowflake VARIANT — automatic per-path micro-clustering inside the column; no explicit index.
When to use semi-structured vs proper relational columns.
- Use semi-structured when the schema genuinely evolves (per-feature settings, third-party webhook payloads, telemetry where new fields arrive continuously) — and most fields are written and ignored.
-
Promote to columns when a field is accessed often. The
schema-on-readinstinct ("just dump it all intoJSONB") is a trap — every access pays a parse cost, and the optimizer cannot use column-level stats. -
Rule of thumb: if a key in
JSONBis read more than 5% of the time, promote it to a top-level column with aNOT NULLconstraint or an index.
Worked example — store and query a flexible product-attributes payload
Detailed explanation. A product catalog has stable columns (sku, name, price) plus a long-tail of category-specific attributes (size for clothing, watts for appliances, megapixels for cameras). Modeling every possible attribute as a column produces a sparse, ever-growing schema. The right tool is JSONB with a GIN index — the stable columns stay relational; the variable attributes live in one indexable document column.
Question. Design a products table that mixes stable columns with a flexible attribute document. Then write three queries: find all clothing in size M, find any product where the brand is Nike, and count products per brand.
Input (sample documents).
| sku | name | attributes (JSONB) |
|---|---|---|
| SKU001 | T-Shirt | {"category":"clothing","size":"M","brand":"Acme","color":"blue"} |
| SKU002 | Running shoes | {"category":"footwear","size":"42","brand":"Nike","color":"black"} |
| SKU003 | Microwave | {"category":"appliance","watts":900,"brand":"PanaPro","color":"white"} |
| SKU004 | T-Shirt | {"category":"clothing","size":"M","brand":"Nike","color":"red"} |
Code (Postgres).
CREATE TABLE products (
product_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
sku CHAR(8) NOT NULL UNIQUE,
name TEXT NOT NULL,
price_cents BIGINT NOT NULL CHECK (price_cents >= 0),
attributes JSONB NOT NULL
);
CREATE INDEX products_attrs_gin_idx
ON products USING GIN (attributes jsonb_path_ops);
-- All clothing in size M
SELECT sku, name
FROM products
WHERE attributes @> '{"category":"clothing","size":"M"}';
-- Any product where the brand is Nike (key access)
SELECT sku, name, attributes->>'brand' AS brand
FROM products
WHERE attributes->>'brand' = 'Nike';
-- Count products per brand
SELECT attributes->>'brand' AS brand, COUNT(*) AS n
FROM products
GROUP BY attributes->>'brand'
ORDER BY n DESC;
Step-by-step explanation.
-
JSONB storage. Each
attributesvalue is parsed on write into Postgres' binary JSON representation. Whitespace is dropped; keys may be reordered; the result fits in roughly the same disk space as the text payload. -
GIN index on
jsonb_path_ops. Indexes every key/value path inside the document.attributes @> '{"key":"value"}'becomes an index probe rather than a sequential scan; query cost drops from O(N) to O(log N). -
Containment query (
@>).attributes @> '{"category":"clothing","size":"M"}'matches every row whose attributes document contains both("category","clothing")and("size","M"). Returns SKU001 and SKU004. -
Key access (
->>).attributes->>'brand'returns the brand astext.WHERE attributes->>'brand' = 'Nike'uses an expression index if one exists; otherwise falls back to a sequential scan. For frequent access, consider an expression indexCREATE INDEX … ON products ((attributes->>'brand')). -
Group by JSON key. Same
->>operator. Postgres groups by the extracted text. Returns one row per distinct brand with counts.
Output (query 1 — clothing size M).
| sku | name |
|---|---|
| SKU001 | T-Shirt |
| SKU004 | T-Shirt |
Output (query 2 — brand Nike).
| sku | name | brand |
|---|---|---|
| SKU002 | Running shoes | Nike |
| SKU004 | T-Shirt | Nike |
Output (query 3 — count per brand).
| brand | n |
|---|---|
| Nike | 2 |
| Acme | 1 |
| PanaPro | 1 |
Rule of thumb. When attribute frequency exceeds ~5% across the dataset, promote that key to a top-level column with a real index. When it stays sparse, leave it in JSONB and let GIN do the work.
Worked example — STRUCT vs JSON in BigQuery for a typed event payload
Detailed explanation. A click event has a stable shape: user_id, event_ts, plus a nested device block with os, browser, and screen_width. In BigQuery, two natural choices are a STRUCT<> column (typed nested record) or a JSON column (untyped). The STRUCT form is typed at write time, queryable via dot notation, and compatible with BigQuery's columnar scan; the JSON form is more flexible but forces parsing on every read.
Question. Compare a BigQuery STRUCT<> event table to a JSON event table for a clickstream payload. Show the schema, an INSERT, and the two equivalent queries.
Input (one sample event).
| Field | Value |
|---|---|
| user_id | 42 |
| event_ts | 2026-06-02 12:00:00 UTC |
| device.os | iOS |
| device.browser | Safari |
| device.screen_width | 390 |
Code (BigQuery, illustrative).
-- Schema A — typed STRUCT
CREATE TABLE events_struct (
user_id INT64,
event_ts TIMESTAMP,
device STRUCT<os STRING, browser STRING, screen_width INT64>
);
INSERT events_struct VALUES
(42, TIMESTAMP '2026-06-02 12:00:00 UTC',
STRUCT('iOS' AS os, 'Safari' AS browser, 390 AS screen_width));
-- Query A — dot access, columnar scan
SELECT user_id, device.os, COUNT(*) AS n
FROM events_struct
GROUP BY user_id, device.os;
-- Schema B — JSON
CREATE TABLE events_json (
user_id INT64,
event_ts TIMESTAMP,
device JSON
);
INSERT events_json VALUES
(42, TIMESTAMP '2026-06-02 12:00:00 UTC',
JSON '{"os":"iOS","browser":"Safari","screen_width":390}');
-- Query B — JSON extraction
SELECT user_id, JSON_VALUE(device, '$.os') AS os, COUNT(*) AS n
FROM events_json
GROUP BY user_id, os;
Step-by-step explanation.
-
STRUCT schema. BigQuery stores
device.os,device.browser,device.screen_widthas separate columns under the hood — each is statistically tracked, columnar-scanned, and individually compressible. - JSON schema. The whole payload sits in one cell as a binary JSON value. Reads pay a parse cost per row; the optimizer cannot use per-key statistics.
-
Query cost (STRUCT). Reading
device.ostouches only thedevice.osshard. Bytes scanned drops to the size of that one nested column — typically 1-2 bytes per row after compression. -
Query cost (JSON). Reading
JSON_VALUE(device, '$.os')parses the wholedevicepayload for every row. Bytes scanned is the full JSON column — usually 5-10× the STRUCT cost. -
When to use which. Use
STRUCTwhen the shape is stable and known up front. UseJSONwhen the shape evolves continuously or when third-party payloads land with unpredictable keys. Promote frequent JSON keys toSTRUCTfields when query volume justifies it.
Output.
| Schema | Bytes scanned (per query) | Type safety | Schema evolution |
|---|---|---|---|
| STRUCT | small (only device.os column) |
strong (typed at write) | requires ALTER to add a key |
| JSON | large (whole device column) |
weak (parse on read) | free — any key any time |
Rule of thumb. STRUCT for stable, queryable nested shapes; JSON for genuinely evolving payloads. Promote stable keys out of JSON into typed columns as soon as they earn the query traffic.
SQL interview question on ARRAY columns vs join tables
A common probe in a modeling round sounds like: "When would you use a TEXT[] column for tags instead of a separate post_tags join table?" The interviewer is testing whether the candidate balances ergonomics (one row, one column) against the cost of indexing and query patterns.
Solution Using TEXT[] plus a GIN index for fast membership
CREATE TABLE posts (
post_id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
title TEXT NOT NULL,
tags TEXT[] NOT NULL DEFAULT '{}'
);
CREATE INDEX posts_tags_gin_idx
ON posts USING GIN (tags);
INSERT INTO posts (title, tags) VALUES
('Intro to SQL data types', ARRAY['sql','data-types','postgres']),
('Stream processing 101', ARRAY['streaming','kafka','flink']),
('Schema design notes', ARRAY['sql','schema','data-types']);
-- All posts tagged 'sql'
SELECT post_id, title
FROM posts
WHERE tags @> ARRAY['sql'];
-- Explode tags into one row per (post, tag) pair
SELECT post_id, title, unnest(tags) AS tag
FROM posts;
Step-by-step trace.
| Step | Operator | What it does |
|---|---|---|
| 1 |
TEXT[] column |
stores all tags for one post in one row; no join table required for the simple case |
| 2 | GIN ON tags |
index every distinct tag value; @> queries are index-backed |
| 3 | tags @> ARRAY['sql'] |
"tags contains 'sql'" — returns posts 1 and 3 |
| 4 | unnest(tags) |
expands the array into one row per element; lets you join, aggregate, or window across tags |
| 5 | Trade-off | very fast for read-heavy "list / filter by tag"; less natural for "set the tag's display name" or "rename a tag everywhere" — those still need a tag dimension table |
Output (tags @> ARRAY['sql']).
| post_id | title |
|---|---|
| 1 | Intro to SQL data types |
| 3 | Schema design notes |
Why this works — concept by concept:
-
ARRAY column + GIN — the array is stored inline with the row; the GIN index serves all
@>and&&membership queries in O(log N). -
unnest — the canonical explode operator turns an array column into a vertical relation. Useful for
GROUP BY tagstyle queries without a separate join table. -
When a join table is still right — when each tag has its own metadata (display name, hex color, parent-tag), promote
tagsto atagsdimension table and apost_tags(post_id, tag_id)join table. The array column is the read-optimized shortcut; the join table is the canonical representation. -
JSONB alternative — same idea, more flexibility; use when the tag entries are objects (
{"name":"sql","weight":0.8}) rather than scalars. -
Cost — write = O(K) per row (K tags); read = O(log N) on membership; storage =
~25 B per row + tag bytes.
SQL
Topic — JSON
JSON column problems (SQL)
Choosing the right type by domain (cheat sheet)
-
Money →
NUMERIC(19, 4)(orBIGINTcents when sub-cent precision isn't needed). NeverFLOATorDOUBLE. The0.1 + 0.2 = 0.30000000000000004demo is your reference for why. -
Surrogate primary keys →
BIGINT GENERATED ALWAYS AS IDENTITY. For distributed systems where multiple writers mint IDs,UUID(16 bytes;gen_random_uuid()in Postgres). -
Foreign keys → match the parent column exactly. Mixing
INTparent +BIGINTchild is a common cause of slow joins. -
Timestamps →
TIMESTAMPTZfor every event-time.DATEfor calendar-day attributes.BIGINTepoch only when the storage layer doesn't have a real time-zone type. -
Short codes (currency, country, US state) →
CHAR(N)whereNis the exact width. -
Names and free-form text →
TEXTin Postgres;VARCHAR(255)(or sized for the longest realistic value) in MySQL. -
Booleans → native
BOOLEANeverywhere except SQL Server (BIT). In MySQL, declareBOOLEAN; it maps toTINYINT(1)but reads cleanly across tools. -
Tags / multi-value attributes →
TEXT[](Postgres) orARRAY<STRING>(BigQuery / Snowflake). Avoid the "comma-separated string" anti-pattern. -
Sparse / evolving attributes →
JSONB(Postgres) /JSON(MySQL) /VARIANT(Snowflake) /JSON(BigQuery). Add a GIN index in Postgres when access is frequent. -
Typed nested records →
STRUCTin BigQuery / Snowflake / Spark; composite types orJSONBin Postgres. -
Large binary blobs →
BYTEA(Postgres),BLOB(MySQL),VARBINARY(MAX)(SQL Server). For anything over a few MB, store in object storage and keep only a pointer in the database.
Frequently asked questions
Is VARCHAR(255) faster than VARCHAR(1000) in Postgres?
No. In Postgres, VARCHAR(N) and TEXT share the same storage path; the declared length is only a constraint enforced on INSERT and UPDATE. Smaller values use less disk regardless of the declared maximum. The "VARCHAR(255) is faster" rule of thumb originates from MySQL, where the declared length contributes to the row-width budget (capped at 65,535 bytes); shorter VARCHARs in MySQL leave room for more columns. In Postgres, prefer TEXT for unbounded strings and VARCHAR(N) only when N encodes a real domain constraint.
Should I use NUMERIC or FLOAT for currency?
Always NUMERIC (or DECIMAL, which is the same type in most dialects). FLOAT and DOUBLE PRECISION are IEEE-754, which cannot exactly represent decimal fractions like 0.1. Sums drift, audits fail, and the bugs surface only after enough volume. Use NUMERIC(19, 4) as a sensible default for retail money (15 digits before the decimal, 4 after). For very high precision (high-frequency trading), push the scale to NUMERIC(38, 8) or store integer-millicents in a BIGINT and format on display. The slightly slower comparisons are a small price for exact arithmetic.
TIMESTAMP vs TIMESTAMPTZ — which should I use?
TIMESTAMPTZ (TIMESTAMP WITH TIME ZONE) for almost every event-time column in production. Postgres stores it internally as UTC and renders it in the session time zone on read, which means the same row reads consistently across regions and time-zone-aware joins behave correctly. TIMESTAMP (without zone) is appropriate only when the value is genuinely zoneless — recurring schedules expressed in local hours, business-day cutoffs, working hours — where the meaning is "9am wherever the user is." Note that MySQL's TIMESTAMP already behaves like a TZ-aware column (it converts to UTC on write), while its DATETIME is the literal zoneless type — exactly opposite to Postgres' naming.
When should I use JSONB instead of normalised tables?
Reach for JSONB (or its dialect equivalent — VARIANT in Snowflake, JSON in BigQuery) when the schema is genuinely evolving and most fields are written-and-ignored: per-feature settings, third-party webhook payloads, telemetry where new keys arrive continuously. Stick to normalised columns when fields are accessed often, have referential integrity (foreign keys), or need column-level statistics for the planner. A useful rule of thumb is: if a key inside the document is read in more than ~5% of queries, promote it to a top-level column with an explicit type and a real index. JSONB is a feature, not the default — used everywhere it turns into a slow, opaque schema-on-read pile.
INT vs BIGINT for primary keys — does it really matter?
For surrogate primary keys, default to BIGINT. A signed 32-bit INT tops out at ~2.1 billion, which is a single quarter of traffic on a busy event table. Migrating a primary key from INT to BIGINT on a billion-row production table is an offline operation in most databases — far more expensive than the extra 4 bytes per row. For small lookups (a list of currencies, US states, weekday names), INT or even SMALLINT is fine; for any table that might grow without bound, start at BIGINT. The 8-byte width also matches the platform's native integer register on every modern CPU, so the planner's comparisons stay cheap.
How do data types affect query plans?
Three ways. First, the planner uses column statistics — distinct values, histogram, MCVs — to estimate predicates, and those stats are typed: a TEXT column's stats are different from a JSONB key's stats. Mistyped columns get rougher estimates and worse plans. Second, operator dispatch is type-driven — WHERE id = '42' against an INT column triggers an implicit cast that can disable an index in some dialects. Third, fixed-width types live at predictable offsets in the row, which lets the engine skip directly to a column during a sequential scan; variable-width types require length-prefix arithmetic. The net: matching column types to the domain produces tighter plans, faster scans, and fewer surprises in EXPLAIN.
Practice on PipeCode
- Drill the SQL practice library → for end-to-end type-driven schema, joins, and aggregation problems.
- Sharpen type-casting problems → for
CAST,::, and implicit coercion drills. - Practice date-functions problems → for
TIMESTAMPTZ,AT TIME ZONE,EXTRACT, and DST traps. - Work through JSON problems → when the interviewer wants
JSONBoperators and GIN access patterns. - For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- For a dialect-deep refresher, see SQL data types in PostgreSQL →.
- For end-to-end schema design, work through data modelling for DE interviews →.
- For the SQL muscle that pairs with type design, take SQL for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every type decision above ships with hands-on practice rooms where you write real `NUMERIC`, `JSONB`, and `TIMESTAMPTZ` queries against realistic schemas. Start with the SQL library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)