Gowtham Potureddi

Posted on Jun 2

SQL Data Types Deep Dive: INT, NUMERIC, VARCHAR, JSON, ARRAY, TIMESTAMP

#python #sql #interview #dataengineering

sql data types decide three things on every column you ever ship — how many bytes the row costs on disk, how fast the planner can compare values, and whether arithmetic returns the answer the business expects. Every senior data engineering review reaches for the type catalogue because the wrong choice at table-design time becomes a silent bug under load — INT overflow on a fast-growing event table, FLOAT rounding on a payment ledger, TIMESTAMP without time zone drifting across regions.

This guide is a working reference for the types of sql data types that show up in every schema: integers and decimals, fixed and variable-length strings, calendar and clock columns, and the semi-structured JSON / ARRAY / STRUCT families that modern warehouses lean on. Across five primitives — numeric, string, date-time, semi-structured, plus a domain-by-domain cheat sheet — we cover what each sql types family stores, when to reach for it, and the data datatype in sql pitfalls that catch experienced engineers. Each section pairs a teaching block with a worked example showing the code, a step-by-step trace, an output table, and a concept-by-concept why-this-works breakdown.

When you want hands-on reps immediately after reading, browse SQL practice library →, drill type-casting problems →, and rehearse on date-functions problems →.

On this page

Why data types decide query speed and storage cost
Numeric types — INT, NUMERIC, FLOAT, DOUBLE
String types — CHAR, VARCHAR, TEXT, encoding
Date and time types — DATE, TIME, TIMESTAMP, TIMESTAMPTZ, INTERVAL
Semi-structured types — JSON, JSONB, ARRAY, STRUCT
Cheat sheet — pick the right type by domain
Frequently asked questions
Practice on PipeCode

1. Why data types decide query speed and storage cost

Types are the only contract the database, the planner, and the application all share

The one-sentence invariant: a column's data type tells the storage engine how many bytes to reserve per row, tells the planner which comparison operator to use, and tells every downstream application what shape the value will take when it crosses the wire. Once you internalise that mental model — bytes first, semantics second — every sql data types decision collapses into a three-axis trade-off between storage, performance, and correctness.

Why "any type that fits" is the wrong instinct.

Storage. A BIGINT is 8 bytes, an INT is 4. On a billion-row event table that is 4 GB of difference, plus the same again for every index that touches the column.
Performance. Comparing two NUMERIC(38,9) values is materially slower than comparing two BIGINT values because the database must walk a variable-length representation. Multiply that by every join key and every WHERE predicate.
Correctness. FLOAT cannot exactly represent 0.1 — the canonical demo is 0.1 + 0.2 = 0.30000000000000004. Storing money in FLOAT is a guaranteed audit-failure waiting for enough volume.

The three levers types control.

Storage layout. Fixed-width types (INT, BIGINT, DATE, TIMESTAMP) sit at predictable offsets in the row — fast row scans, easy vectorisation. Variable-width types (VARCHAR, TEXT, NUMERIC, JSONB) include a length prefix or out-of-row pointer.
Operator dispatch. Each type ships with its own equality, comparison, hash, and arithmetic operators. A TEXT = INT comparison either fails with a type error (strict dialects like Postgres) or quietly coerces one side to the other (MySQL, SQLite) — the latter is where silent bugs live.
Index eligibility. Some indexes only support fixed-width types; others (B-tree, GIN, GiST) handle variable-width but with different cost. Choosing a type implicitly picks which indexes are useful on the column.

Type-mismatch bugs that look like data bugs.

INT overflow. A signed 32-bit INT tops out at ~2.1 billion. A fast-moving event sequence will hit that ceiling and the next insert either errors out (Postgres) or wraps to a negative number (MySQL with strict-mode off).
VARCHAR(N) truncation. MySQL with sql_mode not including STRICT_TRANS_TABLES will silently truncate a 30-character string into a VARCHAR(20) column. The application sees no error; the report sees mangled data.
FLOAT == FLOAT traps. WHERE amount = 0.30 will fail to match a row whose FLOAT amount is actually 0.30000000000000004. The bug looks like "missing data" until somebody traces the binary representation.
Time zone drift. Storing a TIMESTAMP (no time zone) and presenting it in two different application time zones produces two different wall-clock answers for the same row — a guaranteed bug as soon as your team crosses a continent.

Dialect variations that bite at migration time.

BOOLEAN — Postgres has a native boolean. MySQL stores it as TINYINT(1). Snowflake has a real BOOLEAN. Always check before porting.
TEXT vs VARCHAR(MAX) — Postgres TEXT and VARCHAR(n) use the same storage; SQL Server distinguishes VARCHAR(MAX) from VARCHAR(8000); Oracle uses CLOB.
TIMESTAMP semantics — Postgres TIMESTAMP has no time zone; TIMESTAMPTZ stores UTC. MySQL TIMESTAMP quietly converts to the session time zone on read; DATETIME does not. Snowflake has TIMESTAMP_NTZ, TIMESTAMP_LTZ, TIMESTAMP_TZ — three distinct semantics.
JSON vs JSONB — Postgres has both; JSONB is the binary, indexable, almost-always-correct choice. MySQL has JSON (also binary). Snowflake uses VARIANT. BigQuery uses JSON plus first-class STRUCT and ARRAY.

What interviewers and senior reviewers listen for.

Do you reach for NUMERIC(p, s) the moment money is mentioned? — senior signal.
Do you say "always store UTC, present in app TZ" when time zones come up? — required answer.
Do you ask "what is the upper bound on this string?" before choosing VARCHAR(n) vs TEXT? — required answer.
Do you mention JSONB + a GIN index when "flexible schema" comes up in Postgres? — current-default signal.

Worked example — type-driven storage cost on a billion-row event table

Detailed explanation. A realistic schema-design pause point is: "we expect a billion rows in this event table — how much do my column choices cost?" The answer reveals exactly how much weight types carry in a production warehouse. Below we walk a six-column event table where each column is chosen twice — once as a defensible production type, once as a casual choice that looks similar at design time but balloons on disk.

Question. Estimate the on-disk size of a billion-row event table under two schemas: production_schema (engineer-grade types) and casual_schema (everything wide). Then explain why the gap matters even before indexes.

Input.

Column	production_schema	casual_schema
event_id	BIGINT (8 B)	NUMERIC(38) (~17 B)
user_id	INT (4 B)	BIGINT (8 B)
event_type	SMALLINT (2 B, lookup)	VARCHAR(64) (~24 B avg)
amount_cents	INT (4 B)	NUMERIC(19, 4) (~10 B)
event_ts	TIMESTAMPTZ (8 B)	TIMESTAMP (8 B)
payload	JSONB (~120 B avg)	JSON text (~180 B avg)

Code (Postgres).

-- production schema
CREATE TABLE events_prod (
    event_id     BIGINT      NOT NULL,
    user_id      INT         NOT NULL,
    event_type   SMALLINT    NOT NULL,
    amount_cents INT         NOT NULL,
    event_ts     TIMESTAMPTZ NOT NULL,
    payload      JSONB       NOT NULL,
    PRIMARY KEY (event_id)
);

-- casual schema
CREATE TABLE events_casual (
    event_id     NUMERIC(38) NOT NULL,
    user_id      BIGINT      NOT NULL,
    event_type   VARCHAR(64) NOT NULL,
    amount_cents NUMERIC(19, 4) NOT NULL,
    event_ts     TIMESTAMP   NOT NULL,
    payload      JSON        NOT NULL,
    PRIMARY KEY (event_id)
);

Step-by-step explanation.

Per-row width. Sum the per-column byte estimates above. production_schema ≈ 146 B per row; casual_schema ≈ 247 B per row.
Row overhead. Add ~24 B of Postgres tuple header per row. production_schema ≈ 170 B; casual_schema ≈ 271 B.
Billion-row total (table only). Multiply by 1,000,000,000. production_schema ≈ 170 GB; casual_schema ≈ 271 GB. A 100 GB difference per environment before indexes and replicas.
Index amplification. A primary-key index on a NUMERIC(38) column is ~2× the size of one on a BIGINT. Secondary indexes that touch the wide columns inherit the same penalty.
Cache density. A 24 KB Postgres page holds ~144 production-rows but only ~88 casual-rows. Every range scan touches ~1.6× more pages in the casual schema, which doubles cache pressure under read load.

Output.

Lever	production_schema	casual_schema	Ratio
Per-row width (data + header)	170 B	271 B	1.6×
1 B rows (table only)	~170 GB	~271 GB	1.6×
Primary-key index (rough)	~20 GB	~40 GB	2.0×
Rows per 24 KB page	~144	~88	0.6×
Pages scanned for `count(*)`	~7 M	~11 M	1.6×

Rule of thumb. At design time, treat type width like cost-per-row times row-count. A 100-byte mistake at a billion rows is 100 GB on disk, doubled across primary and replica clusters. Choose the narrowest type the domain allows — BIGINT over NUMERIC(38), INT over BIGINT if 2 B is enough, SMALLINT over INT for lookups.

SQL interview question on type selection at scale

A senior question in a schema-design round often sounds like: "We're modelling a payments event table that will hit a billion rows in the next 18 months. Walk me through how you'd pick the type for each column and what guardrails you'd add." It probes whether the candidate connects the types of sql data types to storage, planner cost, and operational pain in the same sentence.

Solution Using narrow exact types plus explicit constraints

CREATE TABLE payments (
    payment_id    BIGINT        GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_id       BIGINT        NOT NULL,
    merchant_id   INT           NOT NULL,
    amount_cents  BIGINT        NOT NULL CHECK (amount_cents >= 0),
    currency      CHAR(3)       NOT NULL,                -- ISO-4217
    status        SMALLINT      NOT NULL,                -- FK to status lookup
    occurred_at   TIMESTAMPTZ   NOT NULL DEFAULT now(),
    metadata      JSONB         NOT NULL DEFAULT '{}'::jsonb
);

CREATE INDEX payments_user_occurred_idx
    ON payments (user_id, occurred_at DESC);

CREATE INDEX payments_metadata_gin_idx
    ON payments USING GIN (metadata);

Step-by-step trace.

Column	Choice	Why narrower won
payment_id	BIGINT identity	8 B; survives 9 × 10^18 rows; planner uses native compare
user_id	BIGINT	room for global user IDs; matches the auth service width
merchant_id	INT	4 B is enough — merchant count fits in 2 B already
amount_cents	BIGINT	cents avoid `FLOAT`; `BIGINT` (8 B) covers $9.2 × 10^16
currency	CHAR(3)	fixed three-letter ISO code, zero waste
status	SMALLINT	lookup table; never more than a few hundred statuses
occurred_at	TIMESTAMPTZ	UTC under the hood; renders in app TZ
metadata	JSONB	binary, indexable, evolving

Output:

Metric	Value	Note
Per-row width (data + header)	~92 B	tight; payload sits at ~50 B median
1 B rows (table only)	~92 GB	budget-friendly
Index footprint	~30 GB	PK + composite + GIN
Read pattern	user + time	composite index serves user dashboards
Failure mode	overflow / truncation	`BIGINT` and `CHAR(3)` both error cleanly

Why this works — concept by concept:

Narrow exact types — every column uses the smallest exact type the domain allows. BIGINT for IDs and money-as-cents, INT for foreign keys, SMALLINT for enums, CHAR(3) for ISO codes. No NUMERIC(38), no wide VARCHAR unless the domain needs it.
Cents over FLOAT — storing money as integer cents (or NUMERIC(19, 4) if sub-cent precision matters) avoids IEEE-754 rounding. The application formats display values; the database stores facts.
TIMESTAMPTZ — Postgres stores UTC internally and renders in the session time zone. Same row, different time zones, consistent moment-in-time semantics.
JSONB — binary storage with key parsing on write, indexable via GIN. The right choice for evolving payloads where most fields are not yet promoted to columns.
Constraints — NOT NULL, CHECK (>=0), foreign-key references on merchant_id and status keep bad rows out before the application even sees them.
Cost — write latency = O(per-row fixed-width append); read latency = O(log N) on the composite index; storage = O(rows × 92 B + indexes).

Worked example — the implicit-cast trap that kills index usage

Detailed explanation. A subtle but expensive consequence of mismatched sql types is implicit casting at query time. When the application sends a string literal against an integer column, the planner has to coerce one side or the other. In many dialects (MySQL is the prototypical offender), the cast happens on the column — which means a B-tree index on that column is no longer usable. The query falls back to a sequential scan, and a 5 ms point lookup becomes a 5-second table sweep at scale.

Question. Given orders(user_id BIGINT, status SMALLINT) with a B-tree index on user_id, explain why WHERE user_id = '42' is fast in some dialects and slow in others, and write the fix that works everywhere.

Input.

Dialect	Predicate	Behaviour
Postgres	`WHERE user_id = '42'`	parses `'42'` as `text`; planner casts `'42'` to `BIGINT`; index used
MySQL	`WHERE user_id = '42'`	same predicate; planner casts column to `VARCHAR`; index not used
SQL Server	`WHERE user_id = '42'`	similar to MySQL — implicit cast can disable the index
Universal fix	`WHERE user_id = 42`	literal is already the right type; index used everywhere

Code.

-- Slow path on MySQL — implicit column cast disables the index
EXPLAIN SELECT * FROM orders WHERE user_id = '42';
-- → type: ALL, rows: 1_000_000   (full table scan)

-- Fast path everywhere
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
-- → type: ref,   rows: 8         (index lookup)

-- If the literal must be a string (e.g. from a parameterised ORM), cast it
EXPLAIN SELECT * FROM orders WHERE user_id = CAST('42' AS UNSIGNED);
-- → type: ref,   rows: 8         (index lookup again)

Step-by-step explanation.

Predicate type matching. The planner walks user_id = '42' and notices the types differ (BIGINT on the left, VARCHAR on the right). Some dialects coerce the literal; some coerce the column.
Column-side coercion is the killer. Wrapping the column in a cast (CAST(user_id AS VARCHAR) = '42') makes the B-tree index inapplicable — the index is on BIGINT values, not on their VARCHAR projection.
MySQL's particular behaviour. Historically, MySQL preferred to cast the integer column to a string when comparing against a string literal — which silently disabled the index. Modern MySQL is better, but the rule still bites in many setups.
The fix. Always send the literal in the column's type. ORMs typically do this if you pass 42 (int) rather than "42" (string). Parameterised drivers do it when the bind parameter is declared as the right SQL type.
Diagnostic move. When a query is mysteriously slow, run EXPLAIN. A type: ALL row count equal to the table size is the implicit-cast smell.

Output.

Query	Plan	Rows scanned
`WHERE user_id = '42'` (MySQL)	full table scan	~1,000,000
`WHERE user_id = 42` (every dialect)	index lookup	~8
`WHERE user_id = CAST('42' AS UNSIGNED)`	index lookup	~8

Rule of thumb. Strings bind to string columns; integers bind to integer columns. Never let your ORM or driver decide. When a query is slow, EXPLAIN first and check the predicate types second — implicit casts are the #1 reason a "well-indexed" query takes a second.

SQL
Topic — SQL
SQL schema-design and type-selection problems

Practice →

2. Numeric types — INT, NUMERIC, FLOAT, DOUBLE

Numeric columns split into two families — exact integers and approximate floats — and a third (NUMERIC) that is "exact, slow, and unbounded"

The mental model in one line: integers and NUMERIC are exact, FLOAT and DOUBLE are approximate — the database does not let you mix them without telling you what was rounded. Once you know which family a column belongs to, every sql data types decision in the numeric column family collapses to two questions: "what's the range?" and "do I need exact arithmetic?"

The INT family in five bullets.

SMALLINT — 2 bytes, range -32,768 to 32,767. Lookups, small enums, weekday flags. Underused in modern schemas; reach for it when the domain genuinely fits.
INT (a.k.a. INTEGER) — 4 bytes, range -2,147,483,648 to 2,147,483,647. The default for counts, foreign keys, anything under ~2.1 B. Falls over silently if your sequence outgrows it.
BIGINT — 8 bytes, range ~±9.2 × 10^18. The default for surrogate primary keys in any system that might ever exceed 2 B rows. Always use BIGINT for event tables.
SERIAL / BIGSERIAL (Postgres legacy) → GENERATED ALWAYS AS IDENTITY is the modern equivalent; safer, follows the SQL standard.
Auto-promotion is not free. Migrating an INT primary key to BIGINT on a billion-row table is an offline operation in most production databases. Choose BIGINT up front for surrogate keys.

NUMERIC(p, s) and DECIMAL — exact arithmetic.

NUMERIC(p, s) stores up to p total digits with s digits after the decimal point. NUMERIC(19, 4) covers all money values you will ever encounter in a single ledger.
Arithmetic is exact — 0.1 + 0.2 = 0.3 exactly. No IEEE-754 rounding.
Storage is variable — Postgres uses ~2 bytes per 4 decimal digits plus overhead. A NUMERIC(19, 4) averages ~10 bytes.
Performance is slower than fixed-width integers — the engine cannot vectorise variable-length comparisons as cheaply.
DECIMAL is the SQL-standard synonym for NUMERIC in most dialects. They are interchangeable.

FLOAT, DOUBLE PRECISION, REAL — approximate arithmetic.

IEEE-754 single (REAL / FLOAT(24), 4 bytes) and double (DOUBLE PRECISION / FLOAT(53), 8 bytes) precision.
Never store money in FLOAT or DOUBLE. 0.1 + 0.2 ≠ 0.3 at the binary level; aggregate sums drift; audits fail.
Good for sensor telemetry, scientific data, analytics where a known relative tolerance is acceptable.
Equality on FLOAT is essentially never the right operator. Use abs(a - b) < epsilon instead.

BOOLEAN — small, special, dialect-dependent.

Postgres: native 1-byte boolean with true / false / null semantics.
MySQL: there is no real BOOLEAN. BOOL and BOOLEAN are aliases for TINYINT(1) — 0 or 1.
Snowflake: native BOOLEAN.
SQL Server: no BOOLEAN; use BIT.
BigQuery: native BOOL.
Portability tip: when modelling for cross-dialect deployment, use a BOOLEAN-named column and let the migration tool map it; don't hard-code TINYINT(1).

Choosing between BIGINT and NUMERIC for counters.

Counter that increments by 1 — BIGINT. 8 bytes, fast compare, no rounding ever.
Counter that accumulates fractional amounts — NUMERIC(p, s). Decide p based on the lifetime sum, s based on the smallest unit you have to track.
Counter that holds aggregate sensor totals — BIGINT if you can keep the unit small enough (microseconds, nanograms); NUMERIC if you need arbitrary fractional precision.

Worked example — pick the right numeric type for a payments ledger

Detailed explanation. A payments ledger is the canonical exact-arithmetic surface. The natural columns are amount (the transaction value), fee_amount (a derived value), and running_balance (a per-account aggregate). The right type for each is NUMERIC(19, 4) — exact, big enough for global ledgers, small enough to index efficiently. The wrong type for any of them is FLOAT, and the demo below shows exactly why.

Question. Two engineers propose competing types for a payments table — Engineer A uses NUMERIC(19, 4), Engineer B uses DOUBLE PRECISION. After 100 microtransactions of $0.10 each, what does each schema report for SUM(amount), and which one matches the auditor's expectation of $10.00?

Input.

Engineer	Column	Type	Inserted value
A	amount	NUMERIC(19, 4)	0.1000 × 100 rows
B	amount	DOUBLE PRECISION	0.1 × 100 rows

Code (Postgres).

-- exact ledger
CREATE TABLE ledger_a (id BIGINT PRIMARY KEY, amount NUMERIC(19, 4));
INSERT INTO ledger_a SELECT g, 0.10 FROM generate_series(1, 100) g;

-- float ledger
CREATE TABLE ledger_b (id BIGINT PRIMARY KEY, amount DOUBLE PRECISION);
INSERT INTO ledger_b SELECT g, 0.10 FROM generate_series(1, 100) g;

SELECT 'A' AS who, SUM(amount) AS total FROM ledger_a
UNION ALL
SELECT 'B', SUM(amount) FROM ledger_b;

Step-by-step explanation.

Insert. Each row inserts 0.10. Ledger A stores the exact decimal 0.1000; ledger B stores the closest DOUBLE to 0.1, which is 0.1000000000000000055511151231257827021181583404541015625.
Sum. Ledger A's SUM adds 100 exact 0.1000 values → 10.0000. Ledger B's SUM adds 100 approximate values; the rounding errors accumulate.
Compare. Ledger A reports 10.0000. Ledger B reports something like 10.000000000000002 — visibly wrong to the auditor.
Why does this matter? The application probably formats both as $10.00 on screen. But the underlying value in ledger B is not $10.00 — it is "as close to $10.00 as DOUBLE can express." Any equality check (WHERE total = 10.00) will fail; any aggregation across many such sums will drift further.

Output.

Ledger	Column type	SUM(amount)	Audit pass?
A	NUMERIC(19, 4)	10.0000	✓
B	DOUBLE PRECISION	~10.000000000000002	✗

Rule of thumb. Money is NUMERIC(p, s). The p and s are set by the largest expected aggregate and the smallest unit you track. For most consumer applications NUMERIC(19, 4) is the right answer (15 digits before the decimal, 4 after). For high-frequency trading, push s to 8.

Worked example — SMALLINT for a status enum saves real bytes at scale

Detailed explanation. A status column on a high-volume table — order status, event status, request status — is often modelled as a VARCHAR(20) containing values like pending, confirmed, shipped. The right type, in almost every dialect, is SMALLINT plus a lookup table. The string form costs ~24 bytes per row for the average 8-character status (4 B length-prefix + content + alignment); the SMALLINT form costs 2 bytes. On a billion-row table that is 22 GB saved before any index touches the column.

Question. Compare three ways to model an order status: VARCHAR(20), CHAR(10), and SMALLINT with a status_lookup table. Show the storage and the query cost for "give me the order count by status."

Input (the three schemas).

Schema	Column	Type	Bytes per row
A	status	VARCHAR(20)	~12 B (avg "shipped" = 7 chars + header)
B	status	CHAR(10)	10 B (always; padded)
C	status_id	SMALLINT	2 B

Code.

-- Schema C — lookup table + SMALLINT FK
CREATE TABLE order_status_lookup (
    status_id   SMALLINT PRIMARY KEY,
    status_name TEXT NOT NULL UNIQUE
);

INSERT INTO order_status_lookup VALUES
    (1, 'pending'), (2, 'confirmed'), (3, 'shipped'),
    (4, 'delivered'), (5, 'cancelled');

CREATE TABLE orders_c (
    order_id  BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    status_id SMALLINT NOT NULL REFERENCES order_status_lookup(status_id),
    placed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Count by status (Schema C)
SELECT s.status_name, COUNT(*) AS n
FROM orders_c o
JOIN order_status_lookup s USING (status_id)
GROUP BY s.status_name
ORDER BY n DESC;

Step-by-step explanation.

Lookup table. Five rows; ~50 bytes total. Lives in the buffer cache forever; the join is effectively free.
SMALLINT FK. Each orders_c row stores a 2-byte integer instead of an 8-12 byte string. On a billion rows that is 6-10 GB saved on the table alone.
Index footprint. A B-tree index on status_id is also 2 bytes per entry plus overhead. A B-tree on VARCHAR(20) is 8-12 bytes per entry plus the same overhead. Pull-through savings are 3-6×.
Query shape. The JOIN to the lookup table is a hash join with five entries on the build side — it does not show up in EXPLAIN ANALYZE time profiles.
Display friendliness. Reports and dashboards see the textual status because of the join. The wire format stays human-readable; the storage stays compact.

Output (billion-row estimate, table only).

Schema	Bytes per row	Total (1 B rows)	Index (rough)
A — VARCHAR(20)	~12	~12 GB	~14 GB
B — CHAR(10)	10	~10 GB	~12 GB
C — SMALLINT + lookup	2 + lookup ~50 B	~2 GB	~3 GB

Rule of thumb. Whenever a column has a closed, small set of values (≤ ~32k), reach for SMALLINT plus a lookup table. The storage is 4-6× tighter, the index is just as small, and the lookup table makes the enum self-documenting in the database itself.

SQL interview question on choosing the integer width

A common probe in a schema-review round sounds like: "We have a fast-growing event table — how do you pick between INT and BIGINT for the primary key?" It tests whether the candidate connects the column type to the lifetime of the table, not just the row count today.

Solution Using BIGINT identity with explicit overflow guard

CREATE TABLE events (
    event_id  BIGINT      GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_id   BIGINT      NOT NULL,
    occurred  TIMESTAMPTZ NOT NULL DEFAULT now(),
    payload   JSONB       NOT NULL
);

-- Cheap overflow probe — runs in O(1)
SELECT
    pg_sequence_last_value('events_event_id_seq') AS current_id,
    9223372036854775807                            AS bigint_max,
    9223372036854775807
      - pg_sequence_last_value('events_event_id_seq') AS headroom;

Step-by-step trace.

Step	Action	Effect
1	`BIGINT GENERATED ALWAYS AS IDENTITY`	8-byte width; covers `9.2 × 10^18` rows; modern SQL-standard syntax
2	Each insert	sequence advances by 1; row gets the next ID, atomic across sessions
3	At 2 B rows	an `INT` schema would have already overflowed; `BIGINT` reports `2,000,000,000` with `~9.2 × 10^18` of headroom
4	`pg_sequence_last_value`	constant-time check; safe to wire into a monitoring dashboard
5	At 9 × 10^18 rows	still healthy; in practice you will retire the table long before this

Output:

Column	Type	Width	Range	Safe-for
event_id	BIGINT	8 B	~±9.2 × 10^18	event tables, audit logs
user_id	BIGINT	8 B	global ID space	matches auth service
occurred	TIMESTAMPTZ	8 B	year 4713 BC to 294276 AD	every transaction
payload	JSONB	variable	binary JSON	indexable via GIN

Why this works — concept by concept:

BIGINT for surrogate keys — 8 bytes is the cheapest type that survives the lifetime of any high-throughput table. Migrating from INT to BIGINT after the fact is an offline operation in most production databases.
GENERATED ALWAYS AS IDENTITY — the SQL-standard replacement for SERIAL. Locked-down by default (cannot be overridden by INSERT … VALUES (id, …)), which prevents accidental insertions that skip the sequence.
Headroom monitoring — pg_sequence_last_value lets ops paint a single number on a dashboard: "current ID / max ID." Linear extrapolation gives a calendar date for the overflow event, decades out for BIGINT.
Width discipline — every fixed-width 8-byte column lines up cleanly in the row, which keeps the page format efficient. NUMERIC(38) would have varied per row and inflated index size by ~2×.
Cost — insert latency = O(1) sequence advance; storage = 8 B per row + index entry; planner uses native integer compare on every join.

SQL
Topic — type-casting
Type-casting problems (SQL)

Practice →

3. String types — CHAR, VARCHAR, TEXT, encoding

Strings split into fixed-width, length-bounded, and unbounded — pick by predictability, not by habit

The mental model in one line: CHAR(N) is fixed-width with space padding, VARCHAR(N) is variable with a hard upper bound, TEXT is variable with no upper bound — and in modern Postgres they all share the same storage path. Once you internalise that the sql types for strings differ in semantics (the length check), not in storage (the bytes on disk), the noisy "VARCHAR(255) is faster than TEXT" arguments collapse to dialect-specific footnotes.

CHAR(N) — fixed-width, space-padded.

Always stores exactly N characters; pads with trailing spaces if the value is shorter.
Useful when the domain is truly fixed — ISO currency codes (CHAR(3)), US state codes (CHAR(2)), fixed-width hashes (CHAR(64) for SHA-256 hex).
Pitfall: comparison ignores trailing spaces in some dialects ('US' = 'US ' is true in SQL Server CHAR), but inequality and LIKE may not. Always document the semantics.
Use sparingly. In 2026 schemas, CHAR is rare outside genuinely fixed-width codes.

VARCHAR(N) — variable, length-checked.

Stores up to N characters; uses one or two length-prefix bytes plus the actual data.
The length is a constraint, not a storage hint. Postgres reserves the same space for VARCHAR(255) and VARCHAR(1000) — both grow as the data needs. MySQL, in contrast, treats the declared length as part of the row width up to 65,535 bytes (the row-size limit).
VARCHAR(255) myth. In Postgres there is no performance gain from VARCHAR(255) vs VARCHAR(1000) vs TEXT. The myth comes from the MySQL row-width limit, where shorter VARCHARs leave room for more columns. In Postgres, TEXT is the safe default.
Choose N from the domain. If the maximum legal email length is 320, declare VARCHAR(320) — never VARCHAR(255) "because everyone does." The length constraint catches bad input at the database boundary.

TEXT — variable, unbounded.

No declared upper bound. Postgres stores small TEXT inline and large TEXT out-of-row in TOAST (The Oversized-Attribute Storage Technique).
Same storage path as VARCHAR in Postgres — TEXT is the idiomatic Postgres default for any string column.
SQL Server: TEXT is deprecated; use VARCHAR(MAX) or NVARCHAR(MAX).
Oracle: CLOB for character large objects; very different storage semantics.
MySQL: TEXT, MEDIUMTEXT, LONGTEXT for progressively larger limits (64 KB / 16 MB / 4 GB).

Encoding — UTF-8 is the default; the pitfalls are real.

UTF-8 is variable-byte. Each character takes 1 to 4 bytes. A VARCHAR(255) can hold 255 characters, not bytes — but the row-byte limit in MySQL still applies (so a utf8mb4 VARCHAR(255) reserves 255 × 4 = 1020 row bytes for size calculations).
utf8 ≠ utf8mb4 in MySQL. Historically utf8 in MySQL was a 3-byte subset that could not store emojis or many CJK characters. Always use utf8mb4 (the 4-byte variant) — utf8 is essentially a legacy alias.
Collation matters. utf8mb4_general_ci is case-insensitive and accent-insensitive; utf8mb4_bin is byte-wise. Sort order and WHERE name = 'café' semantics depend on the collation. Always pick deliberately.
Silent truncation. With MySQL sql_mode not including STRICT_TRANS_TABLES, inserting a string longer than the column will silently truncate. Postgres always errors. SQL Server errors unless ANSI_WARNINGS OFF.

TEXT vs VARCHAR(MAX) vs CLOB — dialect cheat sheet.

Postgres — TEXT is canonical. Use VARCHAR(N) only when N is genuinely a constraint.
MySQL — VARCHAR(N) for bounded strings (up to ~21,000 characters with utf8mb4 and a near-empty row), TEXT / MEDIUMTEXT / LONGTEXT for unbounded.
SQL Server — VARCHAR(N) up to 8000, VARCHAR(MAX) for larger. Use NVARCHAR if you need full Unicode (UTF-16).
Oracle — VARCHAR2(N) for ≤ 4000 bytes (or 32767 with MAX_STRING_SIZE=EXTENDED), CLOB for larger.
Snowflake — VARCHAR(N) and STRING are aliases; both are unlimited; N is enforced as a length constraint only.
BigQuery — STRING (unlimited) and BYTES; no VARCHAR.

Worked example — pick string types for a user-profile table

Detailed explanation. A user-profile table mixes truly fixed-width fields (country_code), bounded fields (email), and unbounded fields (bio). The right type per column is determined by the domain — not by tradition.

Question. Design a users table in Postgres that stores a 3-letter country code, an email address (max 320 chars per RFC 5321), a screen name (max 50 chars, enforced at the API), and a free-form bio of any length. Show the schema, then a row that exercises each column.

Input.

Column	Domain	Right type
country_code	ISO-3166 alpha-3 (always 3 letters)	CHAR(3)
email	RFC 5321 max 320	VARCHAR(320)
screen_name	product rule: ≤ 50 chars	VARCHAR(50)
bio	free-form, can hold emoji	TEXT

Code.

CREATE TABLE users (
    user_id      BIGINT      GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    country_code CHAR(3)     NOT NULL,
    email        VARCHAR(320) NOT NULL UNIQUE,
    screen_name  VARCHAR(50)  NOT NULL,
    bio          TEXT         NOT NULL DEFAULT ''
);

INSERT INTO users (country_code, email, screen_name, bio)
VALUES
('USA', 'alice@example.com',  'alice',  'data engineer · ❤ sql'),
('IND', 'bob@example.com',    'bob42',  ''),
('GBR', 'carol@example.com',  'carol-c', 'I write about systems.');

Step-by-step explanation.

country_code CHAR(3) — fixed-width because every ISO-3166 alpha-3 code is exactly 3 letters. Storage is 3 bytes plus a tiny overhead. Reads always return exactly 3 characters; no padding surprises because the domain is already 3.
email VARCHAR(320) — bounded by the RFC. The database enforces the upper bound, which catches bad input at the boundary rather than in the application. UTF-8 friendly (most emails are ASCII; the column also handles internationalized addresses).
screen_name VARCHAR(50) — bounded by the product rule. If the rule changes, the migration is a single ALTER COLUMN; until then, no surprises in the schema.
bio TEXT — unbounded, supports emoji, handled identically to VARCHAR(50) under the hood in Postgres. Small values stay inline; very long values are TOASTed out-of-row.
Insert exercise — three rows demonstrate that bio can hold the emoji ❤ (UTF-8 multi-byte) without any column-width juggling.

Output.

user_id	country_code	email	screen_name	bio
1	USA	alice@example.com	alice	data engineer · ❤ sql
2	IND	bob@example.com	bob42
3	GBR	carol@example.com	carol-c	I write about systems.

Rule of thumb. Use CHAR(N) only when the domain is truly fixed-width. Use VARCHAR(N) to encode a real domain constraint that you want the database to enforce. Use TEXT for genuinely unbounded strings. Never use VARCHAR(255) as a default just because the textbook example used it.

Worked example — MySQL silent truncation vs Postgres strict failure

Detailed explanation. A team migrates a VARCHAR(20) column from MySQL to Postgres. The source data has a few rows where a typo extended the value to 30 characters. MySQL accepted them with silent truncation; Postgres refuses, raising value too long for type character varying(20). This is the canonical "dialect bites you" story for data datatype in sql sizing — and the right fix is rarely "make the column wider," it is "decide what the domain really is."

Question. Given a VARCHAR(20) column receiving a 30-character value, show what MySQL with sql_mode not including STRICT_TRANS_TABLES does, what Postgres does, and the right way to handle the cross-dialect migration.

Input.

Source	Column	Value being inserted	Length
MySQL (sql_mode lax)	name VARCHAR(20)	'Alessandra-Maria Rodriguez'	27
Postgres	name VARCHAR(20)	'Alessandra-Maria Rodriguez'	27

Code.

-- MySQL — silent truncation when sql_mode is lax
CREATE TABLE customers_my (id BIGINT PRIMARY KEY, name VARCHAR(20));
INSERT INTO customers_my VALUES (1, 'Alessandra-Maria Rodriguez');
SELECT id, name, length(name) FROM customers_my;
-- → (1, 'Alessandra-Maria Rod', 20)   ⚠️ silently truncated

-- Postgres — clean error
CREATE TABLE customers_pg (id BIGINT PRIMARY KEY, name VARCHAR(20));
INSERT INTO customers_pg VALUES (1, 'Alessandra-Maria Rodriguez');
-- → ERROR: value too long for type character varying(20)

Step-by-step explanation.

MySQL lax mode. Without STRICT_TRANS_TABLES, the server treats the excess characters as "a warning, not an error." The truncated value is committed. The application receives a successful INSERT and never sees the warning unless it polls SHOW WARNINGS.
Postgres strict-always. Postgres raises 22001 string_data_right_truncation and rolls back the transaction. The application sees the failure immediately.
Migration discovery. The migration tool pulls one row at a time and inserts into Postgres; the first oversized row fails the entire batch. The team now sees the data is wider than the schema claims.
The right fix. Two options. Either the domain genuinely needs more room (then ALTER COLUMN name TYPE VARCHAR(64) or move to TEXT), or the wider values were typos that should have been caught at the source (then clean the data and add an application-side check).
The wrong fix. "Just remove the length constraint" — turning every VARCHAR(20) into TEXT removes the database's ability to catch bad input at the boundary. The fix should be domain-driven, not error-driven.

Output.

Database	Insert behaviour	Stored value	App observation
MySQL (lax)	succeeds with warning	'Alessandra-Maria Rod'	silent data loss
MySQL (STRICT)	errors	(no insert)	clear failure
Postgres	errors	(no insert)	clear failure

Rule of thumb. Run MySQL with sql_mode including STRICT_TRANS_TABLES. Always. The data you don't catch at the boundary is the data that shows up in a Monday-morning incident.

SQL interview question on string-type pitfalls

A senior probe in a schema review often sounds like: "Why is VARCHAR(50) the wrong default for a column that holds product descriptions across English, Spanish, and Japanese?" The interviewer is checking whether the candidate knows that UTF-8 length is in characters (or bytes, depending on dialect), and that fifty characters in Japanese is more storage than fifty in English.

Solution Using `TEXT` plus an application-side validator and a CHECK constraint

CREATE TABLE products (
    product_id   BIGINT      GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    sku          CHAR(8)     NOT NULL UNIQUE,
    description  TEXT        NOT NULL,
    CHECK (char_length(description) BETWEEN 1 AND 500)
);

INSERT INTO products (sku, description) VALUES
('SKU00001', 'Bluetooth headphones with noise cancellation'),
('SKU00002', 'Auriculares Bluetooth con cancelación de ruido'),
('SKU00003', 'ノイズキャンセリング機能付き Bluetooth ヘッドフォン');

Step-by-step trace.

Row	Language	char_length(description)	byte_length	Stored in TEXT?
1	English	47	47	✓
2	Spanish	46	47 (one accented char ≈ 2 B)	✓
3	Japanese	30	70 (each kanji ≈ 3 B in UTF-8)	✓

Output:

product_id	sku	description
1	SKU00001	Bluetooth headphones with noise cancellation
2	SKU00002	Auriculares Bluetooth con cancelación de ruido
3	SKU00003	ノイズキャンセリング機能付き Bluetooth ヘッドフォン

Why this works — concept by concept:

TEXT in Postgres — same storage path as VARCHAR, no need to pick a magic number for the byte-width of the longest realistic value. The CHECK constraint enforces a character-count bound that is locale-aware.
char_length vs octet_length — char_length returns the number of code points; octet_length returns the number of bytes. Constraints should usually be expressed in characters so the same rule applies regardless of language.
CHAR(8) for SKUs — fixed-width because every SKU is exactly eight characters. Indexes on CHAR(N) are slightly more compact than on VARCHAR(N).
UTF-8 awareness — UTF-8 stores ASCII in 1 byte, Latin extended in 2, BMP in 3, supplementary plane (including most emoji) in 4. The schema never has to know — but you do, when sizing storage budgets.
Cost — storage = 9 B (header) + actual UTF-8 bytes per row; index size depends on the language mix in real data.

SQL
Topic — string functions
String manipulation problems (SQL)

Practice →

4. Date and time types — DATE, TIME, TIMESTAMP, TIMESTAMPTZ, INTERVAL

`data datatype in sql` for time is five distinct primitives — pick by what the column represents, not what the wire format looks like

The mental model in one line: a DATE is a calendar day, a TIME is a wall-clock, a TIMESTAMP is a calendar day plus a wall-clock with no zone, a TIMESTAMPTZ is a fully-specified moment in UTC, and an INTERVAL is a duration with no anchor. Once you can name what each one means, every "store local time? UTC? both?" debate collapses into a single rule: store UTC, present in app TZ.

DATE — a calendar day, no time.

4 bytes in Postgres; 3 in MySQL. Range from 4713-01-01 BC to 5874897-12-31 AD in Postgres.
The right type for birthdays, expiration dates, "the day the order was placed," and any value where the time of day is genuinely unknown or irrelevant.
No time-zone semantics — a DATE is the same calendar value everywhere on Earth.

TIME (TIME WITHOUT TIME ZONE) — a wall-clock, no date.

8 bytes in Postgres. Range 00:00:00 to 24:00:00.
Useful for "store opens at 09:00" — a daily schedule that does not anchor to a specific calendar day.
Rare in modern schemas; usually you want a TIMESTAMP for the actual event and a TIME only for the recurring pattern.

TIMESTAMP (TIMESTAMP WITHOUT TIME ZONE) — date + time, no zone.

8 bytes. Stores the value literally as written; no conversion on read or write.
Almost always the wrong choice for production data. Without a zone, the same column value (2026-06-02 09:00:00) could mean "9am in New York" or "9am in Tokyo" — the database cannot tell.
Useful only when the value is genuinely zoneless — for example, "this report covers the working hours 09:00-17:00 in whichever timezone the user is in."

TIMESTAMPTZ (TIMESTAMP WITH TIME ZONE) — the production default.

8 bytes. Postgres stores it as UTC internally; renders it in the session time zone on read.
The right type for "when did the event happen?" — a single global moment that can be presented in any time zone.
MySQL's TIMESTAMP ≠ Postgres' TIMESTAMP. MySQL's TIMESTAMP already converts to UTC on write and back to the session zone on read; MySQL's DATETIME is the literal "no zone" value. Mind the dialect.
Snowflake has TIMESTAMP_NTZ (no zone), TIMESTAMP_LTZ (local — same idea as Postgres' TZTZ), and TIMESTAMP_TZ (literal stored zone). TIMESTAMP_LTZ is the closest equivalent to Postgres' TIMESTAMPTZ for moment-in-time semantics.

INTERVAL — the underrated type (Postgres-only in this strong form).

Stores a duration: INTERVAL '7 days 3 hours', INTERVAL '90 seconds', INTERVAL '1 year'.
16 bytes in Postgres. Supports natural arithmetic: now() + INTERVAL '7 days', event_ts - other_event_ts.
INTERVAL '1 day' ≠ INTERVAL '24 hours' — across a DST transition, "1 day" is 23 or 25 hours of wall-clock time, while "24 hours" is always 24 elapsed hours.
Not portable. MySQL has INTERVAL only as a literal in expressions, not as a column type. For cross-dialect deployment, store durations as BIGINT seconds (or BIGINT milliseconds) and reconstruct.

Epoch / UNIX timestamp — when to use BIGINT instead.

A BIGINT UNIX-epoch second (or millisecond) value works in every dialect, never loses information across migrations, and indexes as a plain integer.
Trade-off: humans cannot read it. Always pair with a view or computed column that exposes to_timestamp(epoch_seconds).
Use BIGINT epoch when the storage layer is a NoSQL-flavoured warehouse that does not have first-class TIMESTAMPTZ, or when the row will be replicated across dialects.

Time-zone traps in five bullets.

UTC vs local vs app TZ. Always store UTC; convert at the application edge; do not let any layer in the middle "helpfully" reinterpret.
DST 23/25-hour days. Daylight-saving transitions produce 23-hour and 25-hour calendar days. INTERVAL '1 day' follows the calendar; INTERVAL '24 hours' follows the clock. Choose deliberately.
Client-side TZ inference is fragile. A JavaScript new Date() reads the OS time zone. Two users in two cities looking at the same record see two different rendered times. Always render server-side or pass an explicit zone.
AT TIME ZONE. The SQL-standard operator: event_ts AT TIME ZONE 'UTC' converts a TIMESTAMP to a moment; event_ts AT TIME ZONE 'America/New_York' converts a TIMESTAMPTZ back to a local wall-clock for display.
Leap seconds. Most databases (including Postgres) treat the day as exactly 86,400 seconds and "swallow" leap seconds by smearing or repeating. Almost never matters; if it does, you already know.

Worked example — design an events table that survives DST and multi-region reads

Detailed explanation. Two teams in two cities log to the same events table; users in three time zones read dashboards. The only sustainable design is: store UTC in a TIMESTAMPTZ, present in the user's preferred zone at render time. The demo below shows what the same row looks like under New York, London, and Tokyo presentations.

Question. A user in America/New_York posts a comment at "9:00 AM local" on 2026-06-02. Write an INSERT that stores the moment correctly in TIMESTAMPTZ, then write three SELECTs that render the same row as a wall-clock in New York, London, and Tokyo.

Input.

User locale	Wall-clock	UTC equivalent
America/New_York (EDT, UTC-4)	2026-06-02 09:00:00	2026-06-02 13:00:00

Code.

CREATE TABLE comments (
    comment_id  BIGINT       GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    body        TEXT         NOT NULL,
    posted_at   TIMESTAMPTZ  NOT NULL DEFAULT now()
);

-- Insert as a literal in the user's zone — the DB converts to UTC
INSERT INTO comments (body, posted_at)
VALUES ('Hello from NYC',
        TIMESTAMP '2026-06-02 09:00:00' AT TIME ZONE 'America/New_York');

-- Read in three different zones
SELECT body,
       posted_at AT TIME ZONE 'America/New_York' AS nyc_wall_clock,
       posted_at AT TIME ZONE 'Europe/London'    AS london_wall_clock,
       posted_at AT TIME ZONE 'Asia/Tokyo'       AS tokyo_wall_clock
FROM comments;

Step-by-step explanation.

Insert. The literal 2026-06-02 09:00:00 is interpreted in the America/New_York zone via AT TIME ZONE. Postgres converts to UTC (subtract 4 hours during EDT) and stores 2026-06-02 13:00:00+00.
Storage. On disk the value is 8 bytes representing the UTC moment. The original "9am NYC" framing is not stored — the moment is.
Read in NYC. AT TIME ZONE 'America/New_York' converts UTC back to local wall-clock. Reader sees 2026-06-02 09:00:00.
Read in London. Same UTC moment, different rendering. EDT is UTC-4; BST is UTC+1; so the London wall-clock is 2026-06-02 14:00:00.
Read in Tokyo. JST is UTC+9; the Tokyo wall-clock is 2026-06-02 22:00:00.
DST robustness. Six months later (December) the same query in NYC will render 4:00 AM-ish if the same UTC instant is replayed — because EST is UTC-5, not UTC-4. The database doesn't care; the application reasoning stays the same.

Output.

body	nyc_wall_clock	london_wall_clock	tokyo_wall_clock
Hello from NYC	2026-06-02 09:00:00	2026-06-02 14:00:00	2026-06-02 22:00:00

Rule of thumb. TIMESTAMPTZ for every event-time column. Reads use AT TIME ZONE '<user-preferred-zone>'. Never store local time without an explicit zone column — and even then, prefer TIMESTAMPTZ and present locally.

Worked example — INTERVAL arithmetic vs raw seconds across DST

Detailed explanation. A scheduled job needs to add "one day" to a TIMESTAMPTZ. Across the spring-forward DST transition, INTERVAL '1 day' advances the calendar but only ~23 hours of wall-clock time; INTERVAL '24 hours' advances exactly 24 hours but lands on a different calendar day. Pick the wrong one and the job either runs an hour early, an hour late, or doubles up on the transition day.

Question. Given t = 2026-03-08 01:30:00-05 (Saturday night, just before US EDT begins at 2:00 AM on Sunday), compute t + INTERVAL '1 day' and t + INTERVAL '24 hours' and show how they differ.

Input.

Variable	Value
t	2026-03-08 01:30:00-05 (EST, UTC-5)
DST transition	2026-03-08 02:00:00 EST → 03:00:00 EDT

Code (Postgres).

WITH base AS (
    SELECT TIMESTAMPTZ '2026-03-08 01:30:00 America/New_York' AS t
)
SELECT
    t                                  AS original,
    t + INTERVAL '1 day'               AS plus_one_day,
    t + INTERVAL '24 hours'            AS plus_24_hours,
    (t + INTERVAL '1 day')   AT TIME ZONE 'America/New_York' AS plus_one_day_nyc,
    (t + INTERVAL '24 hours') AT TIME ZONE 'America/New_York' AS plus_24h_nyc
FROM base;

Step-by-step explanation.

Original moment. 2026-03-08 01:30:00 America/New_York is EST (UTC-5), so the underlying UTC instant is 2026-03-08 06:30:00+00.
+ INTERVAL '1 day'. Postgres advances the calendar day by one. The result is 2026-03-09 01:30:00 America/New_York, which in EDT (UTC-4) is the UTC instant 2026-03-09 05:30:00+00 — only 23 hours of elapsed time.
+ INTERVAL '24 hours'. Postgres advances 24 hours of wall-clock elapsed time. The UTC instant becomes 2026-03-09 06:30:00+00. Rendered in New York that is 02:30:00 EDT — 25 wall-clock-minutes later on the calendar than the "1 day" form.
Why does it matter? A subscription that renews at "the same wall-clock time tomorrow" should use INTERVAL '1 day'. A token that expires "24 hours from issue" should use INTERVAL '24 hours'. They are genuinely different durations across DST.
The wrong default. Storing a duration as BIGINT seconds and adding seconds always advances 24h × N elapsed seconds. Useful for token expiry; wrong for "next renewal in the user's calendar."

Output.

Expression	UTC instant	New York wall-clock
original	2026-03-08 06:30:00+00	2026-03-08 01:30:00 EST
+ INTERVAL '1 day'	2026-03-09 05:30:00+00	2026-03-09 01:30:00 EDT
+ INTERVAL '24 hours'	2026-03-09 06:30:00+00	2026-03-09 02:30:00 EDT

Rule of thumb. "Same time tomorrow in the user's calendar" → INTERVAL '1 day'. "Twenty-four hours from now" → INTERVAL '24 hours' or BIGINT seconds. Mixing the two across DST produces silent off-by-one-hour bugs.

SQL interview question on DATE vs TIMESTAMP vs TIMESTAMPTZ

A common probe in a senior schema-design round sounds like: "We have a subscriptions table. Should started_at be DATE, TIMESTAMP, or TIMESTAMPTZ? What if the product needs to send a renewal email on the anniversary?" It tests whether the candidate aligns the column type with the semantics the product cares about.

Solution Using TIMESTAMPTZ for the moment plus a generated DATE for the anniversary

CREATE TABLE subscriptions (
    subscription_id BIGINT       GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_id         BIGINT       NOT NULL,
    started_at      TIMESTAMPTZ  NOT NULL,
    user_tz         TEXT         NOT NULL,
    -- Anniversary day in the user's home timezone, computed from started_at
    anniversary     DATE         GENERATED ALWAYS AS
                    ((started_at AT TIME ZONE user_tz)::date) STORED
);

INSERT INTO subscriptions (user_id, started_at, user_tz)
VALUES
(101, TIMESTAMP '2025-06-02 09:00:00' AT TIME ZONE 'America/New_York', 'America/New_York'),
(102, TIMESTAMP '2025-06-02 23:30:00' AT TIME ZONE 'Asia/Tokyo',       'Asia/Tokyo');

Step-by-step trace.

Row	started_at (UTC)	user_tz	(started_at AT TIME ZONE user_tz)::date	anniversary
1	2025-06-02 13:00:00+00	America/New_York	2025-06-02 09:00:00	2025-06-02
2	2025-06-02 14:30:00+00	Asia/Tokyo	2025-06-02 23:30:00	2025-06-02

Output:

subscription_id	user_id	started_at	user_tz	anniversary
1	101	2025-06-02 13:00:00+00	America/New_York	2025-06-02
2	102	2025-06-02 14:30:00+00	Asia/Tokyo	2025-06-02

Why this works — concept by concept:

TIMESTAMPTZ for the moment — the UTC instant is unambiguous; it survives every server move, every replica, every region.
user_tz as a sibling column — the database keeps both "when did it happen?" (UTC) and "in what zone did the user mean it?" — the only durable representation of intent across DST shifts and migrations.
Generated DATE column — GENERATED ALWAYS AS … STORED computes the anniversary once per row, indexes cleanly, and stays consistent with started_at automatically. The renewal job filters on WHERE anniversary = current_date and ships emails on the right day for every user.
No app-side timezone math — the conversion logic lives in the database, where it can be reasoned about and tested. Application code reads the already-shaped column.
Cost — generated columns add ~4 B per row; the saving in cross-team correctness debt is enormous.

SQL
Topic — date functions
Date and time problems (SQL)

Practice →

5. Semi-structured types — JSON, JSONB, ARRAY, STRUCT

`JSONB`, `ARRAY`, and `STRUCT` are the bridge between rigid relational columns and "we don't know the shape yet" data

The mental model in one line: semi-structured types let one column hold a nested document, an ordered list, or a typed record — at the cost of slightly slower writes and a different indexing story. They are the right tool when the schema is genuinely evolving; they are the wrong tool when a field is accessed often, in which case it should be promoted to a top-level column.

JSON vs JSONB in Postgres.

JSON stores the value as text, parses on every read. Faithful to whitespace and key order; slow to query.
JSONB stores the value as a binary tree, parsed once on write. Indexable with GIN, 5-10× faster on read; loses whitespace and may reorder keys.
Production default: JSONB unless you have a specific reason to preserve the exact text (audit trails, contract integrity, signature verification).
MySQL has only JSON, which is binary internally (closer to Postgres' JSONB). MariaDB JSON is text.

VARIANT in Snowflake.

A single type that holds any JSON-like value. Snowflake automatically extracts column shards as it ingests, which lets the optimizer prune unused paths.
Query with column:key dot notation. Returns VARIANT; cast explicitly (::int, ::string) before joining or aggregating.

ARRAY — homogeneous, ordered, indexable in some dialects.

Postgres — first-class ARRAY for any type, e.g. INT[], TEXT[], JSONB[]. Supports unnest(arr) to explode into rows, arr @> ARRAY['x'] for membership, GIN indexing for fast containment queries.
BigQuery — ARRAY<T> is first-class; UNNEST(arr) is the canonical explode operator.
Snowflake — ARRAY is an alias for VARIANT holding an ordered list; LATERAL FLATTEN(input => arr) to explode.
Spark SQL — ArrayType(T), explode with explode(arr) or LATERAL VIEW EXPLODE(arr) t AS elem.

STRUCT / RECORD / MAP — typed nested records.

BigQuery — STRUCT<name STRING, age INT64> for nested typed records; dot-access t.user.name. Pairs well with ARRAY<STRUCT<…>>.
Snowflake — OBJECT is the analogous container; OBJECT_CONSTRUCT(...) to build.
Spark SQL — StructType(StructField('name', StringType()), …); dot-access df.user.name.
Postgres — composite types via CREATE TYPE; less common in practice than JSONB.

UNNEST patterns — the canonical explode operator.

SELECT unnest(tags) FROM posts (Postgres) → one row per array element.
SELECT tag FROM posts, UNNEST(tags) AS tag (BigQuery) → one row per (post, tag) pair.
SELECT t.value FROM posts, LATERAL FLATTEN(input => posts.tags) t (Snowflake) → same idea, slightly different syntax.

Indexing semi-structured data.

Postgres GIN on JSONB — supports @> containment, ? key existence, ?| / ?& multi-key existence. Beats every B-tree index on JSON access patterns.
Postgres GIN on ARRAY — accelerates @> and && membership queries on text or integer arrays.
BigQuery JSON dot-notation — implicit indexing via column-pruning at the storage layer; no explicit index.
Snowflake VARIANT — automatic per-path micro-clustering inside the column; no explicit index.

When to use semi-structured vs proper relational columns.

Use semi-structured when the schema genuinely evolves (per-feature settings, third-party webhook payloads, telemetry where new fields arrive continuously) — and most fields are written and ignored.
Promote to columns when a field is accessed often. The schema-on-read instinct ("just dump it all into JSONB") is a trap — every access pays a parse cost, and the optimizer cannot use column-level stats.
Rule of thumb: if a key in JSONB is read more than 5% of the time, promote it to a top-level column with a NOT NULL constraint or an index.

Worked example — store and query a flexible product-attributes payload

Detailed explanation. A product catalog has stable columns (sku, name, price) plus a long-tail of category-specific attributes (size for clothing, watts for appliances, megapixels for cameras). Modeling every possible attribute as a column produces a sparse, ever-growing schema. The right tool is JSONB with a GIN index — the stable columns stay relational; the variable attributes live in one indexable document column.

Question. Design a products table that mixes stable columns with a flexible attribute document. Then write three queries: find all clothing in size M, find any product where the brand is Nike, and count products per brand.

Input (sample documents).

sku	name	attributes (JSONB)
SKU001	T-Shirt	`{"category":"clothing","size":"M","brand":"Acme","color":"blue"}`
SKU002	Running shoes	`{"category":"footwear","size":"42","brand":"Nike","color":"black"}`
SKU003	Microwave	`{"category":"appliance","watts":900,"brand":"PanaPro","color":"white"}`
SKU004	T-Shirt	`{"category":"clothing","size":"M","brand":"Nike","color":"red"}`

Code (Postgres).

CREATE TABLE products (
    product_id BIGINT      GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    sku        CHAR(8)     NOT NULL UNIQUE,
    name       TEXT        NOT NULL,
    price_cents BIGINT     NOT NULL CHECK (price_cents >= 0),
    attributes JSONB       NOT NULL
);

CREATE INDEX products_attrs_gin_idx
    ON products USING GIN (attributes jsonb_path_ops);

-- All clothing in size M
SELECT sku, name
FROM products
WHERE attributes @> '{"category":"clothing","size":"M"}';

-- Any product where the brand is Nike (key access)
SELECT sku, name, attributes->>'brand' AS brand
FROM products
WHERE attributes->>'brand' = 'Nike';

-- Count products per brand
SELECT attributes->>'brand' AS brand, COUNT(*) AS n
FROM products
GROUP BY attributes->>'brand'
ORDER BY n DESC;

Step-by-step explanation.

JSONB storage. Each attributes value is parsed on write into Postgres' binary JSON representation. Whitespace is dropped; keys may be reordered; the result fits in roughly the same disk space as the text payload.
GIN index on jsonb_path_ops. Indexes every key/value path inside the document. attributes @> '{"key":"value"}' becomes an index probe rather than a sequential scan; query cost drops from O(N) to O(log N).
Containment query (@>). attributes @> '{"category":"clothing","size":"M"}' matches every row whose attributes document contains both ("category","clothing") and ("size","M"). Returns SKU001 and SKU004.
Key access (->>). attributes->>'brand' returns the brand as text. WHERE attributes->>'brand' = 'Nike' uses an expression index if one exists; otherwise falls back to a sequential scan. For frequent access, consider an expression index CREATE INDEX … ON products ((attributes->>'brand')).
Group by JSON key. Same ->> operator. Postgres groups by the extracted text. Returns one row per distinct brand with counts.

Output (query 1 — clothing size M).

sku	name
SKU001	T-Shirt
SKU004	T-Shirt

Output (query 2 — brand Nike).

sku	name	brand
SKU002	Running shoes	Nike
SKU004	T-Shirt	Nike

Output (query 3 — count per brand).

brand	n
Nike	2
Acme	1
PanaPro	1

Rule of thumb. When attribute frequency exceeds ~5% across the dataset, promote that key to a top-level column with a real index. When it stays sparse, leave it in JSONB and let GIN do the work.

Worked example — STRUCT vs JSON in BigQuery for a typed event payload

Detailed explanation. A click event has a stable shape: user_id, event_ts, plus a nested device block with os, browser, and screen_width. In BigQuery, two natural choices are a STRUCT<> column (typed nested record) or a JSON column (untyped). The STRUCT form is typed at write time, queryable via dot notation, and compatible with BigQuery's columnar scan; the JSON form is more flexible but forces parsing on every read.

Question. Compare a BigQuery STRUCT<> event table to a JSON event table for a clickstream payload. Show the schema, an INSERT, and the two equivalent queries.

Input (one sample event).

Field	Value
user_id	42
event_ts	2026-06-02 12:00:00 UTC
device.os	iOS
device.browser	Safari
device.screen_width	390

Code (BigQuery, illustrative).

-- Schema A — typed STRUCT
CREATE TABLE events_struct (
    user_id  INT64,
    event_ts TIMESTAMP,
    device   STRUCT<os STRING, browser STRING, screen_width INT64>
);

INSERT events_struct VALUES
    (42, TIMESTAMP '2026-06-02 12:00:00 UTC',
     STRUCT('iOS' AS os, 'Safari' AS browser, 390 AS screen_width));

-- Query A — dot access, columnar scan
SELECT user_id, device.os, COUNT(*) AS n
FROM events_struct
GROUP BY user_id, device.os;

-- Schema B — JSON
CREATE TABLE events_json (
    user_id  INT64,
    event_ts TIMESTAMP,
    device   JSON
);

INSERT events_json VALUES
    (42, TIMESTAMP '2026-06-02 12:00:00 UTC',
     JSON '{"os":"iOS","browser":"Safari","screen_width":390}');

-- Query B — JSON extraction
SELECT user_id, JSON_VALUE(device, '$.os') AS os, COUNT(*) AS n
FROM events_json
GROUP BY user_id, os;

Step-by-step explanation.

STRUCT schema. BigQuery stores device.os, device.browser, device.screen_width as separate columns under the hood — each is statistically tracked, columnar-scanned, and individually compressible.
JSON schema. The whole payload sits in one cell as a binary JSON value. Reads pay a parse cost per row; the optimizer cannot use per-key statistics.
Query cost (STRUCT). Reading device.os touches only the device.os shard. Bytes scanned drops to the size of that one nested column — typically 1-2 bytes per row after compression.
Query cost (JSON). Reading JSON_VALUE(device, '$.os') parses the whole device payload for every row. Bytes scanned is the full JSON column — usually 5-10× the STRUCT cost.
When to use which. Use STRUCT when the shape is stable and known up front. Use JSON when the shape evolves continuously or when third-party payloads land with unpredictable keys. Promote frequent JSON keys to STRUCT fields when query volume justifies it.

Output.

Schema	Bytes scanned (per query)	Type safety	Schema evolution
STRUCT	small (only `device.os` column)	strong (typed at write)	requires `ALTER` to add a key
JSON	large (whole `device` column)	weak (parse on read)	free — any key any time

Rule of thumb. STRUCT for stable, queryable nested shapes; JSON for genuinely evolving payloads. Promote stable keys out of JSON into typed columns as soon as they earn the query traffic.

SQL interview question on ARRAY columns vs join tables

A common probe in a modeling round sounds like: "When would you use a TEXT[] column for tags instead of a separate post_tags join table?" The interviewer is testing whether the candidate balances ergonomics (one row, one column) against the cost of indexing and query patterns.

Solution Using `TEXT[]` plus a GIN index for fast membership

CREATE TABLE posts (
    post_id BIGINT  GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    title   TEXT    NOT NULL,
    tags    TEXT[]  NOT NULL DEFAULT '{}'
);

CREATE INDEX posts_tags_gin_idx
    ON posts USING GIN (tags);

INSERT INTO posts (title, tags) VALUES
('Intro to SQL data types', ARRAY['sql','data-types','postgres']),
('Stream processing 101',   ARRAY['streaming','kafka','flink']),
('Schema design notes',     ARRAY['sql','schema','data-types']);

-- All posts tagged 'sql'
SELECT post_id, title
FROM posts
WHERE tags @> ARRAY['sql'];

-- Explode tags into one row per (post, tag) pair
SELECT post_id, title, unnest(tags) AS tag
FROM posts;

Step-by-step trace.

Step	Operator	What it does
1	`TEXT[]` column	stores all tags for one post in one row; no join table required for the simple case
2	`GIN ON tags`	index every distinct tag value; `@>` queries are index-backed
3	`tags @> ARRAY['sql']`	"tags contains 'sql'" — returns posts 1 and 3
4	`unnest(tags)`	expands the array into one row per element; lets you join, aggregate, or window across tags
5	Trade-off	very fast for read-heavy "list / filter by tag"; less natural for "set the tag's display name" or "rename a tag everywhere" — those still need a tag dimension table

Output (tags @> ARRAY['sql']).

post_id	title
1	Intro to SQL data types
3	Schema design notes

Why this works — concept by concept:

ARRAY column + GIN — the array is stored inline with the row; the GIN index serves all @> and && membership queries in O(log N).
unnest — the canonical explode operator turns an array column into a vertical relation. Useful for GROUP BY tag style queries without a separate join table.
When a join table is still right — when each tag has its own metadata (display name, hex color, parent-tag), promote tags to a tags dimension table and a post_tags(post_id, tag_id) join table. The array column is the read-optimized shortcut; the join table is the canonical representation.
JSONB alternative — same idea, more flexibility; use when the tag entries are objects ({"name":"sql","weight":0.8}) rather than scalars.
Cost — write = O(K) per row (K tags); read = O(log N) on membership; storage = ~25 B per row + tag bytes.

SQL
Topic — JSON
JSON column problems (SQL)

Practice →

Choosing the right type by domain (cheat sheet)

Money → NUMERIC(19, 4) (or BIGINT cents when sub-cent precision isn't needed). Never FLOAT or DOUBLE. The 0.1 + 0.2 = 0.30000000000000004 demo is your reference for why.
Surrogate primary keys → BIGINT GENERATED ALWAYS AS IDENTITY. For distributed systems where multiple writers mint IDs, UUID (16 bytes; gen_random_uuid() in Postgres).
Foreign keys → match the parent column exactly. Mixing INT parent + BIGINT child is a common cause of slow joins.
Timestamps → TIMESTAMPTZ for every event-time. DATE for calendar-day attributes. BIGINT epoch only when the storage layer doesn't have a real time-zone type.
Short codes (currency, country, US state) → CHAR(N) where N is the exact width.
Names and free-form text → TEXT in Postgres; VARCHAR(255) (or sized for the longest realistic value) in MySQL.
Booleans → native BOOLEAN everywhere except SQL Server (BIT). In MySQL, declare BOOLEAN; it maps to TINYINT(1) but reads cleanly across tools.
Tags / multi-value attributes → TEXT[] (Postgres) or ARRAY<STRING> (BigQuery / Snowflake). Avoid the "comma-separated string" anti-pattern.
Sparse / evolving attributes → JSONB (Postgres) / JSON (MySQL) / VARIANT (Snowflake) / JSON (BigQuery). Add a GIN index in Postgres when access is frequent.
Typed nested records → STRUCT in BigQuery / Snowflake / Spark; composite types or JSONB in Postgres.
Large binary blobs → BYTEA (Postgres), BLOB (MySQL), VARBINARY(MAX) (SQL Server). For anything over a few MB, store in object storage and keep only a pointer in the database.

Frequently asked questions

Is VARCHAR(255) faster than VARCHAR(1000) in Postgres?

No. In Postgres, VARCHAR(N) and TEXT share the same storage path; the declared length is only a constraint enforced on INSERT and UPDATE. Smaller values use less disk regardless of the declared maximum. The "VARCHAR(255) is faster" rule of thumb originates from MySQL, where the declared length contributes to the row-width budget (capped at 65,535 bytes); shorter VARCHARs in MySQL leave room for more columns. In Postgres, prefer TEXT for unbounded strings and VARCHAR(N) only when N encodes a real domain constraint.

Should I use NUMERIC or FLOAT for currency?

Always NUMERIC (or DECIMAL, which is the same type in most dialects). FLOAT and DOUBLE PRECISION are IEEE-754, which cannot exactly represent decimal fractions like 0.1. Sums drift, audits fail, and the bugs surface only after enough volume. Use NUMERIC(19, 4) as a sensible default for retail money (15 digits before the decimal, 4 after). For very high precision (high-frequency trading), push the scale to NUMERIC(38, 8) or store integer-millicents in a BIGINT and format on display. The slightly slower comparisons are a small price for exact arithmetic.

TIMESTAMP vs TIMESTAMPTZ — which should I use?

TIMESTAMPTZ (TIMESTAMP WITH TIME ZONE) for almost every event-time column in production. Postgres stores it internally as UTC and renders it in the session time zone on read, which means the same row reads consistently across regions and time-zone-aware joins behave correctly. TIMESTAMP (without zone) is appropriate only when the value is genuinely zoneless — recurring schedules expressed in local hours, business-day cutoffs, working hours — where the meaning is "9am wherever the user is." Note that MySQL's TIMESTAMP already behaves like a TZ-aware column (it converts to UTC on write), while its DATETIME is the literal zoneless type — exactly opposite to Postgres' naming.

When should I use JSONB instead of normalised tables?

Reach for JSONB (or its dialect equivalent — VARIANT in Snowflake, JSON in BigQuery) when the schema is genuinely evolving and most fields are written-and-ignored: per-feature settings, third-party webhook payloads, telemetry where new keys arrive continuously. Stick to normalised columns when fields are accessed often, have referential integrity (foreign keys), or need column-level statistics for the planner. A useful rule of thumb is: if a key inside the document is read in more than ~5% of queries, promote it to a top-level column with an explicit type and a real index. JSONB is a feature, not the default — used everywhere it turns into a slow, opaque schema-on-read pile.

INT vs BIGINT for primary keys — does it really matter?

For surrogate primary keys, default to BIGINT. A signed 32-bit INT tops out at ~2.1 billion, which is a single quarter of traffic on a busy event table. Migrating a primary key from INT to BIGINT on a billion-row production table is an offline operation in most databases — far more expensive than the extra 4 bytes per row. For small lookups (a list of currencies, US states, weekday names), INT or even SMALLINT is fine; for any table that might grow without bound, start at BIGINT. The 8-byte width also matches the platform's native integer register on every modern CPU, so the planner's comparisons stay cheap.

How do data types affect query plans?

Three ways. First, the planner uses column statistics — distinct values, histogram, MCVs — to estimate predicates, and those stats are typed: a TEXT column's stats are different from a JSONB key's stats. Mistyped columns get rougher estimates and worse plans. Second, operator dispatch is type-driven — WHERE id = '42' against an INT column triggers an implicit cast that can disable an index in some dialects. Third, fixed-width types live at predictable offsets in the row, which lets the engine skip directly to a column during a sequential scan; variable-width types require length-prefix arithmetic. The net: matching column types to the domain produces tighter plans, faster scans, and fewer surprises in EXPLAIN.

Practice on PipeCode

Drill the SQL practice library → for end-to-end type-driven schema, joins, and aggregation problems.
Sharpen type-casting problems → for CAST, ::, and implicit coercion drills.
Practice date-functions problems → for TIMESTAMPTZ, AT TIME ZONE, EXTRACT, and DST traps.
Work through JSON problems → when the interviewer wants JSONB operators and GIN access patterns.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
For a dialect-deep refresher, see SQL data types in PostgreSQL →.
For end-to-end schema design, work through data modelling for DE interviews →.
For the SQL muscle that pairs with type design, take SQL for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every type decision above ships with hands-on practice rooms where you write real `NUMERIC`, `JSONB`, and `TIMESTAMPTZ` queries against realistic schemas. Start with the SQL library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice SQL now →
Type-casting drills →

1. Why data types decide query speed and storage cost

Types are the only contract the database, the planner, and the application all share

Worked example — type-driven storage cost on a billion-row event table

SQL interview question on type selection at scale

Solution Using narrow exact types plus explicit constraints

Worked example — the implicit-cast trap that kills index usage

2. Numeric types — INT, NUMERIC, FLOAT, DOUBLE

Numeric columns split into two families — exact integers and approximate floats — and a third (NUMERIC) that is "exact, slow, and unbounded"

Worked example — pick the right numeric type for a payments ledger

Worked example — SMALLINT for a status enum saves real bytes at scale

SQL interview question on choosing the integer width

Solution Using BIGINT identity with explicit overflow guard

3. String types — CHAR, VARCHAR, TEXT, encoding

Strings split into fixed-width, length-bounded, and unbounded — pick by predictability, not by habit

Worked example — pick string types for a user-profile table

Worked example — MySQL silent truncation vs Postgres strict failure

SQL interview question on string-type pitfalls

Solution Using TEXT plus an application-side validator and a CHECK constraint

4. Date and time types — DATE, TIME, TIMESTAMP, TIMESTAMPTZ, INTERVAL

data datatype in sql for time is five distinct primitives — pick by what the column represents, not what the wire format looks like

Worked example — design an events table that survives DST and multi-region reads

Worked example — INTERVAL arithmetic vs raw seconds across DST

SQL interview question on DATE vs TIMESTAMP vs TIMESTAMPTZ

Solution Using TIMESTAMPTZ for the moment plus a generated DATE for the anniversary

5. Semi-structured types — JSON, JSONB, ARRAY, STRUCT

JSONB, ARRAY, and STRUCT are the bridge between rigid relational columns and "we don't know the shape yet" data

Worked example — store and query a flexible product-attributes payload

Worked example — STRUCT vs JSON in BigQuery for a typed event payload

SQL interview question on ARRAY columns vs join tables

Solution Using TEXT[] plus a GIN index for fast membership

Choosing the right type by domain (cheat sheet)

Frequently asked questions

Is VARCHAR(255) faster than VARCHAR(1000) in Postgres?

Should I use NUMERIC or FLOAT for currency?

TIMESTAMP vs TIMESTAMPTZ — which should I use?

When should I use JSONB instead of normalised tables?

INT vs BIGINT for primary keys — does it really matter?

How do data types affect query plans?

Practice on PipeCode

Solution Using `TEXT` plus an application-side validator and a CHECK constraint

`data datatype in sql` for time is five distinct primitives — pick by what the column represents, not what the wire format looks like

`JSONB`, `ARRAY`, and `STRUCT` are the bridge between rigid relational columns and "we don't know the shape yet" data

Solution Using `TEXT[]` plus a GIN index for fast membership