sql string functions split into five families that every data engineering pipeline eventually touches: concatenation (concat sql — CONCAT, ||, +), slicing (substring sql — SUBSTRING, SUBSTR, LEFT, RIGHT, SPLIT_PART), replacement (sql replace — REPLACE, TRANSLATE, REGEXP_REPLACE), cleaning (sql trim — TRIM, LTRIM, RTRIM, BTRIM), and pattern matching (regexp in sql — ~, REGEXP_LIKE, LIKE, ILIKE). Add delimiter parsing with STRING_AGG / GROUP_CONCAT / STRING_SPLIT and you can answer almost every text-normalization sql interview questions prompt in a panel — from "build a full name column" to "extract the domain from an email" to "flag rows that fail a regex validation rule."
This guide walks through every clause in the string-function family that reviewers test in data engineering interview questions: the VARCHAR / TEXT / CHAR type hierarchy and NULL propagation rules, the dialect jungle around concat sql (CONCAT null-safe vs || vs SQL Server +), substring sql indexing (1-based everywhere except a few legacy APIs), sql replace vs TRANSLATE vs REGEXP_REPLACE for normalization, sql trim variants (BOTH / LEADING / TRAILING and custom trim characters), and regexp in sql vs LIKE for validation and extraction — the patterns every sql for data engineers loop tests. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown.
When you want hands-on reps immediately after reading, browse string-functions SQL practice →, drill the string-manipulation lane →, sharpen string-parsing drills →, rehearse string-concatenation patterns →, or widen coverage on the full SQL practice library →.
On this page
- Why SQL string functions matter in data engineering interviews
- SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior
- CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server
- SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART
- REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns
- TRIM, LTRIM, RTRIM — cleaning dirty text columns
- REGEXP, LIKE, and pattern matching — validation vs extraction
- Common string interview patterns — parsing, aggregation, validation
- Choosing the right string function (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why SQL string functions matter in data engineering interviews
The five string function families — that's the whole interview surface
The one-sentence invariant: sql string functions split into five families — concatenation (CONCAT, ||), slicing (SUBSTRING, LEFT, RIGHT), replacement (REPLACE, TRANSLATE, REGEXP_REPLACE), cleaning (TRIM, LTRIM, RTRIM), and pattern matching (regexp in sql, LIKE, ILIKE) — plus aggregation/parsing with STRING_AGG and SPLIT_PART. Once you internalise that taxonomy, every text-related interview prompt collapses to "pick the right function from the right family."
The five function families.
-
Concatenation —
CONCAT(a, b, c),a || b(Postgres / Oracle),CONCAT(a, b)(MySQL / SQL Server 2012+), legacya + b(SQL Server with NULL trap). -
Slicing —
SUBSTRING(s FROM start FOR len),SUBSTR(s, start, len),LEFT(s, n),RIGHT(s, n),SPLIT_PART(s, delim, n). -
Replacement —
REPLACE(s, old, new),TRANSLATE(s, from, to)(char-by-char map),REGEXP_REPLACE(s, pattern, replacement). -
Cleaning —
TRIM(BOTH ' ' FROM s),LTRIM(s),RTRIM(s),BTRIM(s, chars). -
Pattern matching —
s ~ 'regex'(Postgres),REGEXP_LIKE(s, pattern)(MySQL / Oracle),s LIKE '%pattern%',s ILIKE '%pattern%'(Postgres case-insensitive).
Why interviewers love string functions.
- Raw data is messy — leading spaces, mixed casing, inconsistent delimiters, embedded newlines, phone numbers stored ten different ways.
-
Dialect divergence is real —
concat sqlnull-handling differs across engines;substring sqlstart positions are 1-based in SQL but easy to confuse with 0-based languages. -
Regex support varies — Postgres uses POSIX regex; MySQL has its own engine; SQL Server leans on
LIKE+PATINDEXunless you enable CLR regex. -
Performance traps are common — wrapping an indexed column in
LOWER(email)defeats the index; functional indexes or normalized storage columns are the fix.
What interviewers listen for.
- Do you name the family for the function you're picking? — "this is a cleaning step, so I'll reach for
TRIMbeforeREPLACE." - Do you mention NULL propagation when concatenating? — senior signal.
- Do you use
REGEXP_REPLACEonly when literalREPLACEwon't scale? — shows you know the cost trade-off. - Do you flag 1-based SUBSTRING indexing aloud? — avoids off-by-one bugs in live coding.
Worked example — a single query that uses all five families
Question. From contacts(contact_id, first_name, last_name, email_raw), return a trimmed full name, a lowercased email with @company.com replaced by @corp.com, and the email domain extracted via regex.
Input. contacts slice.
| contact_id | first_name | last_name | email_raw |
|---|---|---|---|
| 1 | Alice | Smith | ALICE@COMPANY.COM |
| 2 | Bob | NULL | bob@test.org |
| 3 | Carol | Lee | carol@company.com |
Code (PostgreSQL).
SELECT contact_id,
TRIM(CONCAT(TRIM(first_name), ' ', TRIM(COALESCE(last_name, '')))) AS full_name, -- cleaning + concat
LOWER(REPLACE(TRIM(email_raw), '@company.com', '@corp.com')) AS normalized_email, -- cleaning + replace
SUBSTRING(TRIM(email_raw) FROM '@(.+)$') AS domain_via_substring, -- slicing (regex capture)
(REGEXP_MATCH(TRIM(email_raw), '@(.+)$'))[1] AS domain_via_regexp -- pattern matching
FROM contacts;
Step-by-step explanation.
-
TRIM(first_name)andTRIM(last_name)strip leading/trailing spaces (cleaning family). -
CONCAT(..., ' ', ...)joins first and last name;COALESCE(last_name, '')prevents NULL from nulling the whole expression in||contexts. -
REPLACE(TRIM(email_raw), '@company.com', '@corp.com')swaps the domain literal (replacement family). -
SUBSTRING(... FROM '@(.+)$')extracts the substring matching the POSIX regex capture (slicing + pattern). -
REGEXP_MATCH(...)[1]pulls capture group 1 as an array element (pattern matching family).
Output.
| contact_id | full_name | normalized_email | domain_via_substring | domain_via_regexp |
|---|---|---|---|---|
| 1 | Alice Smith | alice@corp.com | company.com | company.com |
| 2 | Bob | bob@test.org | test.org | test.org |
| 3 | Carol Lee | carol@corp.com | company.com | company.com |
Rule of thumb: identify the family first, then pick the dialect's function — a far more reliable mental model than memorising every function signature.
SQL
Topic — string-functions
String functions SQL drills
SQL
Topic — string-manipulation
String manipulation SQL practice
2. SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior
Pick the string type that matches the column — half the bugs are wrong-type bugs
Before any function call comes the question of which type the column should be. The string type family is small but the dialect names diverge — knowing when to use VARCHAR(n) vs TEXT vs fixed CHAR(n) is what separates candidates who've shipped production pipelines from candidates who've only read tutorials.
CHAR(n) — fixed-length, space-padded.
-
Stores — exactly
ncharacters; shorter values are right-padded with spaces. - Use it for — legacy fixed-width codes (country codes, status flags) where length is invariant.
-
The gotcha —
'US'stored inCHAR(10)compares unequal to'US'inVARCHARbecause of trailing spaces; alwaysTRIMbefore comparing.
VARCHAR(n) / NVARCHAR(n) — variable-length with optional cap.
-
Stores — up to
ncharacters without padding. -
MySQL —
VARCHAR(255)is the classic indexed-column choice; longer values spill to off-page storage. -
SQL Server —
VARCHAR(n)for ASCII;NVARCHAR(n)for Unicode (UTF-16);VARCHAR(MAX)for large text. - Use it for — names, emails, SKUs, any text with a known reasonable max length.
TEXT / CLOB — unlimited (or very large) variable-length.
-
PostgreSQL —
TEXThas no length limit and performs identically toVARCHARwithout a length constraint; preferTEXTunless you need a CHECK constraint on length. -
MySQL —
TEXT,MEDIUMTEXT,LONGTEXTtiers with different max sizes. - Use it for — JSON payloads, log lines, free-form notes — columns you won't index directly.
BYTEA / BLOB — binary, not text.
-
Not a string type — don't apply
TRIMorCONCATto binary columns; decode first withCONVERT_FROM(bytes, 'UTF8')(Postgres) orCAST(... AS VARCHAR).
NULL propagation — the universal concatenation trap.
-
CONCAT(a, b)(Postgres / SQL Server 2012+) — NULL-safe; treats NULL arguments as empty strings;CONCAT('a', NULL, 'b')→'ab'. -
CONCAT(a, b)(MySQL) — if any argument is NULL, result is NULL; useCONCAT_WS()to skip NULLs. -
a || b(Postgres / Oracle) — NULL poisons the result;'a' || NULL || 'b'→ NULL. -
SQL Server
a + b— NULL poisons; useCONCAT()orCONCAT_WS()instead. -
Fix in Postgres with
||—COALESCE(a, '') || COALESCE(b, '')or use the built-inCONCAT()function (Postgres 9.1+).
Collation and case sensitivity.
-
Default collation determines whether
'ABC' = 'abc'; string comparisons in interviews usually assume case-sensitive unless stated. -
Postgres
ILIKE— case-insensitive pattern match; no ANSI equivalent. -
Normalize early — store
LOWER(email)in a generated column if you filter on it constantly.
Common type pitfalls.
- Storing numbers as
VARCHAR— sorts lexicographically ('10' < '2'); cast before arithmetic. - Comparing
CHARwithoutTRIM— silent mismatches on padded values. - Applying string functions to NULL — most return NULL; use
COALESCEat the boundary.
SQL
Topic — string-functions
String functions library
SQL
Topic — string-manipulation
String manipulation drills
3. CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server
concat sql is where NULL-handling diverges — know all three operators
The concat sql family is deceptively simple until NULL appears. PostgreSQL offers both CONCAT() (null-safe) and || (NULL-poisoning). MySQL's CONCAT() skips NULLs. SQL Server historically used + (NULL-poisoning) before CONCAT() arrived in 2012. Knowing all three is what separates candidates who've worked across engines.
PostgreSQL — CONCAT() vs ||.
-
CONCAT(a, b, c)— variadic, NULL-safe; skips NULL arguments. -
a || b || c— ANSI concatenation operator; any NULL → entire result NULL. -
CONCAT_WS(sep, a, b, c)— "concat with separator"; skips NULL parts and separator between them. -
Example —
CONCAT(first_name, ' ', last_name)vsfirst_name || ' ' || last_name(requiresCOALESCEon nullable columns).
MySQL — CONCAT() and CONCAT_WS().
-
CONCAT(str1, str2, ...)— if any argument is NULL, the entire result is NULL (stricter than PostgresCONCAT). -
CONCAT_WS(sep, str1, str2, ...)— skips NULL arguments and their separators; the MySQL fix for nullable name parts. -
GROUP_CONCAT(col SEPARATOR ',')— aggregate concatenation; the MySQL answer toSTRING_AGG.
SQL Server — CONCAT(), CONCAT_WS(), legacy +.
-
CONCAT(a, b, c)— SQL Server 2012+; NULL-safe (treats NULL as empty string). -
a + b— NULL-poisoning; legacy code still uses it. -
CONCAT_WS(sep, a, b)— SQL Server 2017+; separator between non-NULL values. -
STRING_AGG(col, ',')— aggregate concatenation (SQL Server 2017+).
Snowflake / BigQuery equivalents.
-
Snowflake —
CONCAT(a, b)ora || b(both NULL-safe withIFNULLpatterns);LISTAGG(col, ',')for aggregates. -
BigQuery —
CONCAT(a, b); NULL → NULL unless you wrap withIFNULL;STRING_AGG(col, ',')for aggregates.
When to reach for concatenation vs a template.
-
Simple join —
CONCAT(first, ' ', last). -
Conditional pieces —
CONCAT_WS(' ', NULLIF(middle, ''), last)skips blank middle names cleanly. -
Aggregate rollups —
STRING_AGG(tag, ', ' ORDER BY tag)(Postgres) instead of concatenating in application code.
SQL interview question — build a display name with optional middle initial
Assume employees(emp_id, first_name, middle_name, last_name). Return emp_id and a display_name formatted as Last, First M. — omit the middle initial entirely when middle_name is NULL or empty; trim all parts first.
Solution Using CONCAT_WS + NULLIF + TRIM
Code (PostgreSQL).
SELECT emp_id,
CONCAT_WS(', ',
TRIM(last_name),
CONCAT_WS(' ',
TRIM(first_name),
NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
)
) AS display_name
FROM employees;
Equivalent in MySQL.
SELECT emp_id,
CONCAT_WS(', ',
TRIM(last_name),
CONCAT_WS(' ',
TRIM(first_name),
NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
)
) AS display_name
FROM employees;
Equivalent in SQL Server.
SELECT emp_id,
CONCAT_WS(', ',
TRIM(last_name),
CONCAT_WS(' ',
TRIM(first_name),
NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
)
) AS display_name
FROM employees;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 |
TRIM each name part |
strips padding spaces from CHAR columns |
| 2 | LEFT(TRIM(middle_name), 1) |
first character of middle name |
| 3 |
CONCAT(..., '.') then NULLIF(..., '.')
|
produces 'M.' or NULL when middle is empty |
| 4 | inner CONCAT_WS(' ', first, middle_initial)
|
joins first name and optional initial; skips NULL middle |
| 5 | outer CONCAT_WS(', ', last, first_block)
|
produces 'Smith, Alice M.' format |
Output:
| emp_id | display_name |
|---|---|
| 1 | Smith, Alice M. |
| 2 | Jones, Bob |
| 3 | Lee, Carol A. |
Why this works — concept by concept:
-
TRIM before concat — fixed-width
CHARcolumns and CSV imports often carry trailing spaces that would break display formatting. -
LEFT + CONCAT for initial — extracts one character and appends a period; simpler than
SUBSTRING(middle_name FROM 1 FOR 1). -
NULLIF(..., '.') — when
middle_nameis NULL or empty,LEFTreturns empty string,CONCATyields'.', andNULLIFconverts that sentinel to NULL soCONCAT_WSskips it. -
CONCAT_WS skips NULLs — no manual
CASE WHEN middle_name IS NOT NULLwrapper needed; cleaner and dialect-portable. -
Cost — single projection over the table;
O(n)row scan with constant-time string ops per row.
SQL
Topic — string-concatenation
String concatenation drills
SQL
Topic — string-functions
String functions SQL practice
4. SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART
substring sql is 1-based everywhere — say that aloud in interviews
The slicing family extracts a portion of a string. Every major SQL dialect uses 1-based start positions (unlike Python's 0-based indexing). Negative start positions work in some dialects (Postgres SUBSTRING(s FROM -3) counts from the end). Off-by-one errors are the #1 bug in live coding — state "start is 1-based" before you write the query.
ANSI SUBSTRING(string FROM start [FOR length]).
-
Postgres / Oracle —
SUBSTRING('abcdef' FROM 2 FOR 3)→'bcd'. -
MySQL —
SUBSTRING('abcdef', 2, 3)orSUBSTR('abcdef', 2, 3)(function-call form). -
SQL Server —
SUBSTRING('abcdef', 2, 3)(same as MySQL). -
Omit length — returns from
startto end of string.
LEFT(s, n) and RIGHT(s, n) — convenience wrappers.
-
LEFT('abcdef', 3)→'abc';RIGHT('abcdef', 3)→'def'. - Supported by — MySQL, SQL Server, Postgres (via compatible functions), Snowflake.
- Use when — you need a fixed prefix or suffix, not a middle slice.
SPLIT_PART(s, delimiter, field_number) — Postgres / Snowflake.
-
SPLIT_PART('a,b,c', ',', 2)→'b'(1-based field index). -
MySQL equivalent —
SUBSTRING_INDEX(SUBSTRING_INDEX(s, ',', 2), ',', -1). -
SQL Server equivalent —
STRING_SPLIT(2016+) returns a table; useROW_NUMBER()to pick fieldn.
POSITION(substring IN string) / CHARINDEX / INSTR — find then slice.
-
Pattern —
SUBSTRING(s FROM POSITION('@' IN s) + 1)extracts everything after@. -
Postgres —
POSITION('@' IN email); SQL Server —CHARINDEX('@', email); MySQL —LOCATE('@', email). -
Safer with regex — when delimiter appears multiple times, use
SPLIT_PARTorREGEXP_REPLACEinstead.
Negative indexing (Postgres only).
-
SUBSTRING('abcdef' FROM -2)→'ef'(last two characters). -
Portable alternative —
RIGHT(s, 2).
Common slicing pitfalls.
-
0-based habit from Python —
SUBSTRING(s, 0, 3)in SQL often returns unexpected results or empty string. -
Unicode code points vs bytes —
SUBSTRINGoperates on characters in Postgres; byte slicing needsSUBSTRING(bytes FROM ... FOR ...). -
Empty string vs NULL —
SUBSTRING(NULL, 1, 3)→ NULL;SUBSTRING('', 1, 3)→''.
SQL
Topic — string-parsing
String parsing drills
SQL
Topic — string-manipulation
String manipulation SQL practice
5. REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns
sql replace handles literals; reach for regex when the pattern is variable
The replacement family swaps one substring for another. sql replace with the literal REPLACE() function handles fixed strings (REPLACE(phone, '-', '')). TRANSLATE() maps character-by-character (great for stripping punctuation). REGEXP_REPLACE() handles pattern-based swaps (normalize whitespace, mask digits).
REPLACE(string, old, new) — literal swap, all occurrences.
-
Signature —
REPLACE(s, 'old', 'new'); replaces every occurrence ofold. -
Case-sensitive —
'ABC'won't match'abc'; wrap withLOWER()if case-insensitive replacement is needed. -
Remove a substring —
REPLACE(s, '-', '')deletes every hyphen. - Supported everywhere — Postgres, MySQL, SQL Server, Oracle, Snowflake, BigQuery.
TRANSLATE(string, from, to) — character-by-character map.
-
Postgres / Oracle —
TRANSLATE('a1b2', '0123456789', '')strips all digits. -
SQL Server — no native
TRANSLATE; emulate with nestedREPLACEorREGEXP_REPLACE. - Use when — removing a set of characters (punctuation, digits) rather than a single literal.
REGEXP_REPLACE(string, pattern, replacement [, flags]) — pattern swap.
-
Postgres —
REGEXP_REPLACE(s, '\s+', ' ', 'g')collapses multiple spaces; flag'g'replaces all matches. -
MySQL —
REGEXP_REPLACE(s, '[0-9]', 'X')(MySQL 8.0+). -
SQL Server — no built-in until recent versions; use
PATINDEXloops or application-layer regex. - Use when — the pattern is variable (any digit, any whitespace run, email domain swap with regex).
REPLACE vs REGEXP_REPLACE — decision tree.
-
Known literal —
REPLACE(email, '@old.com', '@new.com'). -
Character class —
REGEXP_REPLACE(phone, '[^0-9]', '', 'g')keeps digits only. -
Performance — literal
REPLACEis faster; regex has compile overhead per row.
Chained normalization pattern (production pipelines).
SELECT REGEXP_REPLACE(
TRIM(
REPLACE(
REPLACE(raw_name, CHR(9), ' '), -- tabs to spaces
' ', ' ' -- double spaces (first pass)
)
),
'\s+', ' ', 'g' -- collapse remaining whitespace
) AS clean_name
FROM staging.users;
SQL
Topic — string-functions
String functions SQL drills
SQL
Topic — string-parsing
String parsing patterns
6. TRIM, LTRIM, RTRIM — cleaning dirty text columns
sql trim is the first function in every cleaning pipeline — know all three directions
Dirty text is the default in raw ingestion tables. sql trim removes leading and trailing characters — usually spaces, but you can specify any character set. Always TRIM before CONCAT, REPLACE, or regex so you don't propagate invisible padding.
ANSI TRIM([LEADING | TRAILING | BOTH] [chars] FROM string).
-
TRIM(BOTH ' ' FROM s)— strips spaces from both ends (default when you writeTRIM(s)in Postgres). -
TRIM(LEADING '0' FROM phone)— strips leading zeros only. -
TRIM(TRAILING '.' FROM url)— strips trailing dots only. -
Custom character set —
TRIM(BOTH ' #' FROM tag)removes spaces and#from both ends.
LTRIM(s) / RTRIM(s) — shorthand.
-
MySQL / SQL Server —
LTRIM(RTRIM(s))is the classic "trim both ends" idiom whenTRIM(BOTH ...)isn't available. -
Postgres —
BTRIM(s)is alias forTRIM(BOTH ' ' FROM s);LTRIM(s, chars)andRTRIM(s, chars)accept custom character sets.
TRIM vs REPLACE vs REGEXP_REPLACE for cleaning.
-
TRIM— only affects leading/trailing characters; safe for indexed columns if applied at ingest. -
REPLACE(s, ' ', '')— removes all spaces including internal ones; wrong for "John Smith". -
REGEXP_REPLACE(s, '^\s+|\s+$', '', 'g')— regex trim when custom whitespace (tabs, newlines) must go.
SQL Server TRIM() (2017+).
-
TRIM(s)— strips spaces from both ends; SQL Server 2017 added ANSI-styleTRIM. -
Legacy —
LTRIM(RTRIM(s))still appears in older codebases.
SQL interview question — normalize phone numbers to digits-only
Assume leads(lead_id, phone_raw) where phone_raw contains mixed formatting ((555) 123-4567, 555.123.4567, 5551234567). Return lead_id and a phone_digits column containing only the numeric characters, with no leading/trailing whitespace.
Solution Using TRIM + REGEXP_REPLACE
Code (PostgreSQL).
SELECT lead_id,
REGEXP_REPLACE(TRIM(phone_raw), '[^0-9]', '', 'g') AS phone_digits
FROM leads;
Equivalent in MySQL 8+.
SELECT lead_id,
REGEXP_REPLACE(TRIM(phone_raw), '[^0-9]', '') AS phone_digits
FROM leads;
Equivalent in SQL Server (nested REPLACE fallback).
SELECT lead_id,
REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
LTRIM(RTRIM(phone_raw)), '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''), '+', '') AS phone_digits
FROM leads;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | TRIM(phone_raw) |
removes leading/trailing spaces from CSV import |
| 2 | REGEXP_REPLACE(..., '[^0-9]', '', 'g') |
strips every non-digit character globally |
| 3 | result column phone_digits
|
uniform format for downstream joins |
Output:
| lead_id | phone_digits |
|---|---|
| 1 | 5551234567 |
| 2 | 5551234567 |
| 3 | 5551234567 |
Why this works — concept by concept:
- TRIM first — whitespace-only differences shouldn't survive into the digit extraction step.
-
Character-class regex —
[^0-9]matches any non-digit in one pass; cleaner than six nestedREPLACEcalls. - Global flag 'g' — without it, Postgres replaces only the first match; phone strings have multiple punctuation marks.
-
Idempotent output — already-clean rows (
5551234567) pass through unchanged. -
Cost —
O(n)scan; regex per row is acceptable for staging tables; for billion-row fact tables, normalize at ingest with a persisted column.
SQL
Topic — string-manipulation
String cleaning SQL drills
SQL
Topic — string-functions
String functions library
7. REGEXP, LIKE, and pattern matching — validation vs extraction
regexp in sql for structure; LIKE for simple prefix/suffix — pick the right tool
Pattern matching splits into two tiers. LIKE (and Postgres ILIKE) handles simple wildcard filters ('a%@%.com'). regexp in sql handles structural validation (email format, phone pattern, extract capture groups). Reviewers test whether you know when regex is overkill.
LIKE and ILIKE — wildcard matching.
-
LIKE 'abc%'— starts withabc. -
LIKE '%abc'— ends withabc. -
LIKE '%abc%'— containsabc. -
LIKE 'a_c'—_matches exactly one character. -
ILIKE '%ABC%'(Postgres) — case-insensitive contains. -
Cannot validate structure —
'a%@%.com'matches'a@@@@.com'; fine for coarse filters, wrong for validation.
Postgres POSIX regex operators.
-
s ~ 'pattern'— case-sensitive match. -
s ~* 'pattern'— case-insensitive match. -
s !~ 'pattern'— does not match. -
REGEXP_MATCH(s, pattern)— returns text array of capture groups (or NULL). -
REGEXP_REPLACE(s, pattern, replacement, 'g')— replace all matches.
MySQL REGEXP / REGEXP_LIKE (8.0+).
-
col REGEXP '^[0-9]+$'— MySQL regex syntax (not identical to Postgres POSIX). -
REGEXP_LIKE(col, pattern)— ANSI-style function form. -
REGEXP_SUBSTR(col, pattern)— extract matching substring (MySQL 8.0+).
SQL Server — LIKE, PATINDEX, limited regex.
-
LIKE '[0-9]%-%'— T-SQL character classes inLIKEpatterns. -
PATINDEX('%[0-9]%', col)— returns start position of first match (0 if none). -
No native
REGEXP_MATCH— use CLR, application layer, or Azure SQL external functions.
Validation vs filtering — interview decision table.
| Goal | Tool | Example |
|---|---|---|
Filter rows containing @
|
LIKE '%@%' |
coarse email filter |
| Validate email structure | ~ '^[^@]+@[^@]+\.[^@]+$' |
regex in WHERE |
| Extract domain from email |
REGEXP_MATCH or SUBSTRING ... FROM '@(.+)$'
|
extraction |
| Case-insensitive search |
ILIKE '%error%' (Postgres) |
log search |
| Flag invalid phone rows | NOT (phone ~ '^[0-9]{10}$') |
data quality check |
Common regex pitfalls.
-
Over-escaping — Postgres string literals need
'\\d'for\d; E-string'\\d'or'[0-9]'is clearer in interviews. - Catastrophic backtracking — avoid nested quantifiers on large text columns.
-
Index unfriendly — regex in
WHEREforces full scan; persist a validation flag column if queried repeatedly.
SQL
Topic — string-functions
REGEXP / LIKE SQL drills
SQL
Topic — string-parsing
String parsing SQL practice
8. Common string interview patterns — parsing, aggregation, validation
Eight patterns cover 80% of string-related interview questions
The remaining 20% are dialect-specific edge cases. Memorise these eight and you'll handle most prompts on the first try.
Pattern 1 — build a full name from parts.
-
TRIM(CONCAT(first_name, ' ', COALESCE(middle_name || ' ', ''), last_name))(Postgres with||). - Or
CONCAT_WS(' ', first_name, middle_name, last_name)(skips NULLs in MySQL/SQL Server style).
Pattern 2 — extract domain from email.
-
Postgres —
(REGEXP_MATCH(email, '@(.+)$'))[1]orSPLIT_PART(email, '@', 2). -
MySQL —
SUBSTRING_INDEX(email, '@', -1). -
SQL Server —
RIGHT(email, LEN(email) - CHARINDEX('@', email)).
Pattern 3 — split delimited string into rows.
-
Postgres —
unnest(string_to_array(tags, ',')). -
MySQL — recursive CTE or
JSON_TABLE(8.0+). -
SQL Server —
STRING_SPLIT(tags, ',')(2016+).
Pattern 4 — aggregate tags back into one string.
-
Postgres —
STRING_AGG(tag, ', ' ORDER BY tag). -
MySQL —
GROUP_CONCAT(tag ORDER BY tag SEPARATOR ', '). -
SQL Server —
STRING_AGG(tag, ', ') WITHIN GROUP (ORDER BY tag).
Pattern 5 — pad to fixed width.
-
Postgres —
LPAD(id::text, 6, '0')→'000042'. -
MySQL —
LPAD(id, 6, '0'). -
SQL Server —
RIGHT('000000' + CAST(id AS VARCHAR), 6).
Pattern 6 — case normalization for joins.
-
LOWER(TRIM(email))on both sides of the join — or persist aemail_normalizedcolumn at ingest.
Pattern 7 — find rows failing a validation rule.
SELECT *
FROM users
WHERE email !~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
Pattern 8 — mask sensitive data in a report.
-
Postgres —
REGEXP_REPLACE(ssn, '(\d{3})-(\d{2})-(\d{4})', '***-**-\3'). - Shows last four digits only; common compliance interview prompt.
SQL
Topic — string-parsing
String parsing drills
SQL
Topic — string-concatenation
String concatenation patterns
Choosing the right string function (cheat sheet)
A one-screen cheat sheet for using SQL string functions — pick the family that matches your intent, then the dialect-specific function.
| You want to … | Function family | Postgres / Standard | MySQL | SQL Server |
|---|---|---|---|---|
| Join strings (NULL-safe) | Concatenation | CONCAT(a, b, c) |
CONCAT(a, b, c) |
CONCAT(a, b, c) |
| Join with separator (skip NULLs) | Concatenation | CONCAT_WS(', ', a, b) |
CONCAT_WS(', ', a, b) |
CONCAT_WS(', ', a, b) |
| Join strings (NULL-poisoning) | Concatenation | `a \ | \ | b` |
| Extract middle of string | Slicing | SUBSTRING(s FROM 2 FOR 3) |
SUBSTRING(s, 2, 3) |
SUBSTRING(s, 2, 3) |
| First / last N characters | Slicing |
LEFT(s, n) / RIGHT(s, n)
|
LEFT(s, n) / RIGHT(s, n)
|
LEFT(s, n) / RIGHT(s, n)
|
| Split on delimiter (field N) | Slicing | SPLIT_PART(s, ',', 2) |
SUBSTRING_INDEX(...) |
STRING_SPLIT + row number |
| Swap literal substring | Replacement | REPLACE(s, 'old', 'new') |
REPLACE(s, 'old', 'new') |
REPLACE(s, 'old', 'new') |
| Strip character set | Replacement | TRANSLATE(s, '012', '') |
nested REPLACE
|
nested REPLACE
|
| Pattern-based swap | Replacement | REGEXP_REPLACE(s, pat, rep, 'g') |
REGEXP_REPLACE(s, pat, rep) |
nested REPLACE / CLR |
| Strip leading/trailing spaces | Cleaning |
TRIM(s) / BTRIM(s)
|
TRIM(s) |
TRIM(s) (2017+) |
| Strip one side only | Cleaning |
LTRIM(s) / RTRIM(s)
|
LTRIM(s) / RTRIM(s)
|
LTRIM(s) / RTRIM(s)
|
| Simple wildcard filter | Pattern | LIKE 'abc%' |
LIKE 'abc%' |
LIKE 'abc%' |
| Case-insensitive filter | Pattern | ILIKE '%abc%' |
LOWER(s) LIKE '%abc%' |
LOWER(s) LIKE '%abc%' |
| Regex match / validate | Pattern | s ~ '^pattern$' |
REGEXP_LIKE(s, pat) |
PATINDEX('%pat%', s) > 0 |
| Extract capture group | Pattern | (REGEXP_MATCH(s, pat))[1] |
REGEXP_SUBSTR(s, pat) |
application layer |
| Aggregate strings | Aggregation | STRING_AGG(col, ', ') |
GROUP_CONCAT(col) |
STRING_AGG(col, ', ') |
Frequently asked questions
What's the difference between CONCAT and || in PostgreSQL?
CONCAT(a, b, c) is variadic and NULL-safe — it treats NULL arguments as empty strings, so CONCAT('a', NULL, 'b') returns 'ab'. The || operator follows standard SQL NULL propagation — if any operand is NULL, the entire result is NULL. Use || when you want strict NULL semantics and will wrap nullable columns with COALESCE. Use CONCAT() or CONCAT_WS() when building display strings from nullable name parts. MySQL's CONCAT() returns NULL if any argument is NULL (different from Postgres); use CONCAT_WS() there to skip NULLs. SQL Server's CONCAT() (2012+) matches Postgres null-safe behavior.
Is SUBSTRING 0-based or 1-based in SQL?
1-based in every major dialect — SUBSTRING('abcdef', 1, 3) returns 'abc', and position 1 is the first character. This differs from Python, JavaScript, and most programming languages where indexing starts at 0. Postgres also supports negative start positions (SUBSTRING(s FROM -3) returns the last three characters). When extracting after a delimiter, compute POSITION('@' IN email) + 1 as the start — the + 1 skips the delimiter itself. Saying "start is 1-based" aloud in an interview prevents the most common off-by-one bug.
When should I use REPLACE vs REGEXP_REPLACE?
Use REPLACE(s, old, new) when swapping a known literal substring — domain rewrites, removing hyphens, fixing a consistent typo. It's faster and easier to read. Use REGEXP_REPLACE(s, pattern, replacement, 'g') when the pattern is variable — strip all non-digits ('[^0-9]'), collapse whitespace ('\s+'), mask SSN digits. In Postgres, remember the 'g' flag for global replacement; without it, only the first match is replaced. For character-set removal (all punctuation), Postgres TRANSLATE() is often cleaner than regex.
How do I trim custom characters, not just spaces?
ANSI syntax: TRIM(LEADING '0' FROM phone) strips leading zeros; TRIM(BOTH ' #' FROM tag) strips spaces and # from both ends. Postgres adds LTRIM(s, chars) and RTRIM(s, chars) with a custom character set argument. MySQL's TRIM() only removes spaces — for custom characters use TRIM(LEADING '0' FROM col) (MySQL 8.0 supports the full ANSI form) or nested REPLACE. SQL Server 2017+ supports TRIM() for spaces; for custom chars use TRIM(' #' FROM col) in compatible modes or LTRIM(RTRIM(REPLACE(s, '#', ''))) as a fallback.
What's the difference between REGEXP and LIKE in SQL?
LIKE uses % (any-length wildcard) and _ (single character) for simple prefix/suffix/contains filters — 'a%@%.com' matches strings starting with a and containing @ then .com, but cannot enforce full email structure. regexp in sql (Postgres ~, MySQL REGEXP_LIKE) matches against a full regular expression — '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' validates email shape. Use LIKE for coarse filtering on indexed columns where possible; use regex for validation flags, extraction (REGEXP_MATCH), and complex replacement. Postgres ILIKE adds case-insensitive LIKE; other dialects emulate with LOWER(col) LIKE LOWER(pattern).
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to string functions (CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP), string manipulation for cleaning and normalization, string parsing for delimiter splits and extraction, and the dialect quirks that fail candidates who memorise only one engine. Whether you're drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same five-family taxonomy this guide teaches.
Kick off via Explore practice →; drill the dedicated string-functions SQL lane →; fan out into the string-manipulation SQL lane →; rehearse string-parsing drills →; sharpen string-concatenation patterns →; widen coverage on the full SQL practice library →.






Top comments (0)