Gowtham Potureddi

Posted on May 22

SQL String Functions: CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP

#python #sql #interview #dataengineering

sql string functions split into five families that every data engineering pipeline eventually touches: concatenation (concat sql — CONCAT, ||, +), slicing (substring sql — SUBSTRING, SUBSTR, LEFT, RIGHT, SPLIT_PART), replacement (sql replace — REPLACE, TRANSLATE, REGEXP_REPLACE), cleaning (sql trim — TRIM, LTRIM, RTRIM, BTRIM), and pattern matching (regexp in sql — ~, REGEXP_LIKE, LIKE, ILIKE). Add delimiter parsing with STRING_AGG / GROUP_CONCAT / STRING_SPLIT and you can answer almost every text-normalization sql interview questions prompt in a panel — from "build a full name column" to "extract the domain from an email" to "flag rows that fail a regex validation rule."

This guide walks through every clause in the string-function family that reviewers test in data engineering interview questions: the VARCHAR / TEXT / CHAR type hierarchy and NULL propagation rules, the dialect jungle around concat sql (CONCAT null-safe vs || vs SQL Server +), substring sql indexing (1-based everywhere except a few legacy APIs), sql replace vs TRANSLATE vs REGEXP_REPLACE for normalization, sql trim variants (BOTH / LEADING / TRAILING and custom trim characters), and regexp in sql vs LIKE for validation and extraction — the patterns every sql for data engineers loop tests. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown.

When you want hands-on reps immediately after reading, browse string-functions SQL practice →, drill the string-manipulation lane →, sharpen string-parsing drills →, rehearse string-concatenation patterns →, or widen coverage on the full SQL practice library →.

On this page

Why SQL string functions matter in data engineering interviews
SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior
CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server
SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART
REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns
TRIM, LTRIM, RTRIM — cleaning dirty text columns
REGEXP, LIKE, and pattern matching — validation vs extraction
Common string interview patterns — parsing, aggregation, validation
Choosing the right string function (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why SQL string functions matter in data engineering interviews

The five string function families — that's the whole interview surface

The one-sentence invariant: sql string functions split into five families — concatenation (CONCAT, ||), slicing (SUBSTRING, LEFT, RIGHT), replacement (REPLACE, TRANSLATE, REGEXP_REPLACE), cleaning (TRIM, LTRIM, RTRIM), and pattern matching (regexp in sql, LIKE, ILIKE) — plus aggregation/parsing with STRING_AGG and SPLIT_PART. Once you internalise that taxonomy, every text-related interview prompt collapses to "pick the right function from the right family."

The five function families.

Concatenation — CONCAT(a, b, c), a || b (Postgres / Oracle), CONCAT(a, b) (MySQL / SQL Server 2012+), legacy a + b (SQL Server with NULL trap).
Slicing — SUBSTRING(s FROM start FOR len), SUBSTR(s, start, len), LEFT(s, n), RIGHT(s, n), SPLIT_PART(s, delim, n).
Replacement — REPLACE(s, old, new), TRANSLATE(s, from, to) (char-by-char map), REGEXP_REPLACE(s, pattern, replacement).
Cleaning — TRIM(BOTH ' ' FROM s), LTRIM(s), RTRIM(s), BTRIM(s, chars).
Pattern matching — s ~ 'regex' (Postgres), REGEXP_LIKE(s, pattern) (MySQL / Oracle), s LIKE '%pattern%', s ILIKE '%pattern%' (Postgres case-insensitive).

Why interviewers love string functions.

Raw data is messy — leading spaces, mixed casing, inconsistent delimiters, embedded newlines, phone numbers stored ten different ways.
Dialect divergence is real — concat sql null-handling differs across engines; substring sql start positions are 1-based in SQL but easy to confuse with 0-based languages.
Regex support varies — Postgres uses POSIX regex; MySQL has its own engine; SQL Server leans on LIKE + PATINDEX unless you enable CLR regex.
Performance traps are common — wrapping an indexed column in LOWER(email) defeats the index; functional indexes or normalized storage columns are the fix.

What interviewers listen for.

Do you name the family for the function you're picking? — "this is a cleaning step, so I'll reach for TRIM before REPLACE."
Do you mention NULL propagation when concatenating? — senior signal.
Do you use REGEXP_REPLACE only when literal REPLACE won't scale? — shows you know the cost trade-off.
Do you flag 1-based SUBSTRING indexing aloud? — avoids off-by-one bugs in live coding.

Worked example — a single query that uses all five families

Question. From contacts(contact_id, first_name, last_name, email_raw), return a trimmed full name, a lowercased email with @company.com replaced by @corp.com, and the email domain extracted via regex.

Input. contacts slice.

contact_id	first_name	last_name	email_raw
1	Alice	Smith	ALICE@COMPANY.COM
2	Bob	NULL	bob@test.org
3	Carol	Lee	carol@company.com

Code (PostgreSQL).

SELECT contact_id,
       TRIM(CONCAT(TRIM(first_name), ' ', TRIM(COALESCE(last_name, ''))))     AS full_name,           -- cleaning + concat
       LOWER(REPLACE(TRIM(email_raw), '@company.com', '@corp.com'))           AS normalized_email,    -- cleaning + replace
       SUBSTRING(TRIM(email_raw) FROM '@(.+)$')                               AS domain_via_substring, -- slicing (regex capture)
       (REGEXP_MATCH(TRIM(email_raw), '@(.+)$'))[1]                           AS domain_via_regexp     -- pattern matching
FROM contacts;

Step-by-step explanation.

TRIM(first_name) and TRIM(last_name) strip leading/trailing spaces (cleaning family).
CONCAT(..., ' ', ...) joins first and last name; COALESCE(last_name, '') prevents NULL from nulling the whole expression in || contexts.
REPLACE(TRIM(email_raw), '@company.com', '@corp.com') swaps the domain literal (replacement family).
SUBSTRING(... FROM '@(.+)$') extracts the substring matching the POSIX regex capture (slicing + pattern).
REGEXP_MATCH(...)[1] pulls capture group 1 as an array element (pattern matching family).

Output.

contact_id	full_name	normalized_email	domain_via_substring	domain_via_regexp
1	Alice Smith	alice@corp.com	company.com	company.com
2	Bob	bob@test.org	test.org	test.org
3	Carol Lee	carol@corp.com	company.com	company.com

Rule of thumb: identify the family first, then pick the dialect's function — a far more reliable mental model than memorising every function signature.

SQL
Topic — string-functions
String functions SQL drills

Practice →

SQL
Topic — string-manipulation
String manipulation SQL practice

Practice →

2. SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior

Pick the string type that matches the column — half the bugs are wrong-type bugs

Before any function call comes the question of which type the column should be. The string type family is small but the dialect names diverge — knowing when to use VARCHAR(n) vs TEXT vs fixed CHAR(n) is what separates candidates who've shipped production pipelines from candidates who've only read tutorials.

CHAR(n) — fixed-length, space-padded.

Stores — exactly n characters; shorter values are right-padded with spaces.
Use it for — legacy fixed-width codes (country codes, status flags) where length is invariant.
The gotcha — 'US' stored in CHAR(10) compares unequal to 'US' in VARCHAR because of trailing spaces; always TRIM before comparing.

VARCHAR(n) / NVARCHAR(n) — variable-length with optional cap.

Stores — up to n characters without padding.
MySQL — VARCHAR(255) is the classic indexed-column choice; longer values spill to off-page storage.
SQL Server — VARCHAR(n) for ASCII; NVARCHAR(n) for Unicode (UTF-16); VARCHAR(MAX) for large text.
Use it for — names, emails, SKUs, any text with a known reasonable max length.

TEXT / CLOB — unlimited (or very large) variable-length.

PostgreSQL — TEXT has no length limit and performs identically to VARCHAR without a length constraint; prefer TEXT unless you need a CHECK constraint on length.
MySQL — TEXT, MEDIUMTEXT, LONGTEXT tiers with different max sizes.
Use it for — JSON payloads, log lines, free-form notes — columns you won't index directly.

BYTEA / BLOB — binary, not text.

Not a string type — don't apply TRIM or CONCAT to binary columns; decode first with CONVERT_FROM(bytes, 'UTF8') (Postgres) or CAST(... AS VARCHAR).

NULL propagation — the universal concatenation trap.

CONCAT(a, b) (Postgres / SQL Server 2012+) — NULL-safe; treats NULL arguments as empty strings; CONCAT('a', NULL, 'b') → 'ab'.
CONCAT(a, b) (MySQL) — if any argument is NULL, result is NULL; use CONCAT_WS() to skip NULLs.
a || b (Postgres / Oracle) — NULL poisons the result; 'a' || NULL || 'b' → NULL.
SQL Server a + b — NULL poisons; use CONCAT() or CONCAT_WS() instead.
Fix in Postgres with || — COALESCE(a, '') || COALESCE(b, '') or use the built-in CONCAT() function (Postgres 9.1+).

Collation and case sensitivity.

Default collation determines whether 'ABC' = 'abc'; string comparisons in interviews usually assume case-sensitive unless stated.
Postgres ILIKE — case-insensitive pattern match; no ANSI equivalent.
Normalize early — store LOWER(email) in a generated column if you filter on it constantly.

Common type pitfalls.

Storing numbers as VARCHAR — sorts lexicographically ('10' < '2'); cast before arithmetic.
Comparing CHAR without TRIM — silent mismatches on padded values.
Applying string functions to NULL — most return NULL; use COALESCE at the boundary.

SQL
Topic — string-functions
String functions library

Practice →

SQL
Topic — string-manipulation
String manipulation drills

Practice →

3. CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server

`concat sql` is where NULL-handling diverges — know all three operators

The concat sql family is deceptively simple until NULL appears. PostgreSQL offers both CONCAT() (null-safe) and || (NULL-poisoning). MySQL's CONCAT() skips NULLs. SQL Server historically used + (NULL-poisoning) before CONCAT() arrived in 2012. Knowing all three is what separates candidates who've worked across engines.

PostgreSQL — CONCAT() vs ||.

CONCAT(a, b, c) — variadic, NULL-safe; skips NULL arguments.
a || b || c — ANSI concatenation operator; any NULL → entire result NULL.
CONCAT_WS(sep, a, b, c) — "concat with separator"; skips NULL parts and separator between them.
Example — CONCAT(first_name, ' ', last_name) vs first_name || ' ' || last_name (requires COALESCE on nullable columns).

MySQL — CONCAT() and CONCAT_WS().

CONCAT(str1, str2, ...) — if any argument is NULL, the entire result is NULL (stricter than Postgres CONCAT).
CONCAT_WS(sep, str1, str2, ...) — skips NULL arguments and their separators; the MySQL fix for nullable name parts.
GROUP_CONCAT(col SEPARATOR ',') — aggregate concatenation; the MySQL answer to STRING_AGG.

SQL Server — CONCAT(), CONCAT_WS(), legacy +.

CONCAT(a, b, c) — SQL Server 2012+; NULL-safe (treats NULL as empty string).
a + b — NULL-poisoning; legacy code still uses it.
CONCAT_WS(sep, a, b) — SQL Server 2017+; separator between non-NULL values.
STRING_AGG(col, ',') — aggregate concatenation (SQL Server 2017+).

Snowflake / BigQuery equivalents.

Snowflake — CONCAT(a, b) or a || b (both NULL-safe with IFNULL patterns); LISTAGG(col, ',') for aggregates.
BigQuery — CONCAT(a, b); NULL → NULL unless you wrap with IFNULL; STRING_AGG(col, ',') for aggregates.

When to reach for concatenation vs a template.

Simple join — CONCAT(first, ' ', last).
Conditional pieces — CONCAT_WS(' ', NULLIF(middle, ''), last) skips blank middle names cleanly.
Aggregate rollups — STRING_AGG(tag, ', ' ORDER BY tag) (Postgres) instead of concatenating in application code.

SQL interview question — build a display name with optional middle initial

Assume employees(emp_id, first_name, middle_name, last_name). Return emp_id and a display_name formatted as Last, First M. — omit the middle initial entirely when middle_name is NULL or empty; trim all parts first.

Solution Using CONCAT_WS + NULLIF + TRIM

Code (PostgreSQL).

SELECT emp_id,
       CONCAT_WS(', ',
                 TRIM(last_name),
                 CONCAT_WS(' ',
                           TRIM(first_name),
                           NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
                          )
                ) AS display_name
FROM employees;

Equivalent in MySQL.

SELECT emp_id,
       CONCAT_WS(', ',
                 TRIM(last_name),
                 CONCAT_WS(' ',
                           TRIM(first_name),
                           NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
                          )
                ) AS display_name
FROM employees;

Equivalent in SQL Server.

SELECT emp_id,
       CONCAT_WS(', ',
                 TRIM(last_name),
                 CONCAT_WS(' ',
                           TRIM(first_name),
                           NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
                          )
                ) AS display_name
FROM employees;

Step-by-step trace.

step	relation	outcome
1	`TRIM` each name part	strips padding spaces from `CHAR` columns
2	`LEFT(TRIM(middle_name), 1)`	first character of middle name
3	`CONCAT(..., '.')` then `NULLIF(..., '.')`	produces `'M.'` or NULL when middle is empty
4	inner `CONCAT_WS(' ', first, middle_initial)`	joins first name and optional initial; skips NULL middle
5	outer `CONCAT_WS(', ', last, first_block)`	produces `'Smith, Alice M.'` format

Output:

emp_id	display_name
1	Smith, Alice M.
2	Jones, Bob
3	Lee, Carol A.

Why this works — concept by concept:

TRIM before concat — fixed-width CHAR columns and CSV imports often carry trailing spaces that would break display formatting.
LEFT + CONCAT for initial — extracts one character and appends a period; simpler than SUBSTRING(middle_name FROM 1 FOR 1).
NULLIF(..., '.') — when middle_name is NULL or empty, LEFT returns empty string, CONCAT yields '.', and NULLIF converts that sentinel to NULL so CONCAT_WS skips it.
CONCAT_WS skips NULLs — no manual CASE WHEN middle_name IS NOT NULL wrapper needed; cleaner and dialect-portable.
Cost — single projection over the table; O(n) row scan with constant-time string ops per row.

SQL
Topic — string-concatenation
String concatenation drills

Practice →

SQL
Topic — string-functions
String functions SQL practice

Practice →

4. SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART

`substring sql` is 1-based everywhere — say that aloud in interviews

The slicing family extracts a portion of a string. Every major SQL dialect uses 1-based start positions (unlike Python's 0-based indexing). Negative start positions work in some dialects (Postgres SUBSTRING(s FROM -3) counts from the end). Off-by-one errors are the #1 bug in live coding — state "start is 1-based" before you write the query.

ANSI SUBSTRING(string FROM start [FOR length]).

Postgres / Oracle — SUBSTRING('abcdef' FROM 2 FOR 3) → 'bcd'.
MySQL — SUBSTRING('abcdef', 2, 3) or SUBSTR('abcdef', 2, 3) (function-call form).
SQL Server — SUBSTRING('abcdef', 2, 3) (same as MySQL).
Omit length — returns from start to end of string.

LEFT(s, n) and RIGHT(s, n) — convenience wrappers.

LEFT('abcdef', 3) → 'abc'; RIGHT('abcdef', 3) → 'def'.
Supported by — MySQL, SQL Server, Postgres (via compatible functions), Snowflake.
Use when — you need a fixed prefix or suffix, not a middle slice.

SPLIT_PART(s, delimiter, field_number) — Postgres / Snowflake.

SPLIT_PART('a,b,c', ',', 2) → 'b' (1-based field index).
MySQL equivalent — SUBSTRING_INDEX(SUBSTRING_INDEX(s, ',', 2), ',', -1).
SQL Server equivalent — STRING_SPLIT (2016+) returns a table; use ROW_NUMBER() to pick field n.

POSITION(substring IN string) / CHARINDEX / INSTR — find then slice.

Pattern — SUBSTRING(s FROM POSITION('@' IN s) + 1) extracts everything after @.
Postgres — POSITION('@' IN email); SQL Server — CHARINDEX('@', email); MySQL — LOCATE('@', email).
Safer with regex — when delimiter appears multiple times, use SPLIT_PART or REGEXP_REPLACE instead.

Negative indexing (Postgres only).

SUBSTRING('abcdef' FROM -2) → 'ef' (last two characters).
Portable alternative — RIGHT(s, 2).

Common slicing pitfalls.

0-based habit from Python — SUBSTRING(s, 0, 3) in SQL often returns unexpected results or empty string.
Unicode code points vs bytes — SUBSTRING operates on characters in Postgres; byte slicing needs SUBSTRING(bytes FROM ... FOR ...).
Empty string vs NULL — SUBSTRING(NULL, 1, 3) → NULL; SUBSTRING('', 1, 3) → ''.

SQL
Topic — string-parsing
String parsing drills

Practice →

SQL
Topic — string-manipulation
String manipulation SQL practice

Practice →

5. REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns

`sql replace` handles literals; reach for regex when the pattern is variable

The replacement family swaps one substring for another. sql replace with the literal REPLACE() function handles fixed strings (REPLACE(phone, '-', '')). TRANSLATE() maps character-by-character (great for stripping punctuation). REGEXP_REPLACE() handles pattern-based swaps (normalize whitespace, mask digits).

REPLACE(string, old, new) — literal swap, all occurrences.

Signature — REPLACE(s, 'old', 'new'); replaces every occurrence of old.
Case-sensitive — 'ABC' won't match 'abc'; wrap with LOWER() if case-insensitive replacement is needed.
Remove a substring — REPLACE(s, '-', '') deletes every hyphen.
Supported everywhere — Postgres, MySQL, SQL Server, Oracle, Snowflake, BigQuery.

TRANSLATE(string, from, to) — character-by-character map.

Postgres / Oracle — TRANSLATE('a1b2', '0123456789', '') strips all digits.
SQL Server — no native TRANSLATE; emulate with nested REPLACE or REGEXP_REPLACE.
Use when — removing a set of characters (punctuation, digits) rather than a single literal.

REGEXP_REPLACE(string, pattern, replacement [, flags]) — pattern swap.

Postgres — REGEXP_REPLACE(s, '\s+', ' ', 'g') collapses multiple spaces; flag 'g' replaces all matches.
MySQL — REGEXP_REPLACE(s, '[0-9]', 'X') (MySQL 8.0+).
SQL Server — no built-in until recent versions; use PATINDEX loops or application-layer regex.
Use when — the pattern is variable (any digit, any whitespace run, email domain swap with regex).

REPLACE vs REGEXP_REPLACE — decision tree.

Known literal — REPLACE(email, '@old.com', '@new.com').
Character class — REGEXP_REPLACE(phone, '[^0-9]', '', 'g') keeps digits only.
Performance — literal REPLACE is faster; regex has compile overhead per row.

Chained normalization pattern (production pipelines).

SELECT REGEXP_REPLACE(
         TRIM(
           REPLACE(
             REPLACE(raw_name, CHR(9), ' '),  -- tabs to spaces
             '  ', ' '                          -- double spaces (first pass)
           )
         ),
         '\s+', ' ', 'g'                        -- collapse remaining whitespace
       ) AS clean_name
FROM staging.users;

SQL
Topic — string-functions
String functions SQL drills

Practice →

SQL
Topic — string-parsing
String parsing patterns

Practice →

6. TRIM, LTRIM, RTRIM — cleaning dirty text columns

`sql trim` is the first function in every cleaning pipeline — know all three directions

Dirty text is the default in raw ingestion tables. sql trim removes leading and trailing characters — usually spaces, but you can specify any character set. Always TRIM before CONCAT, REPLACE, or regex so you don't propagate invisible padding.

ANSI TRIM([LEADING | TRAILING | BOTH] [chars] FROM string).

TRIM(BOTH ' ' FROM s) — strips spaces from both ends (default when you write TRIM(s) in Postgres).
TRIM(LEADING '0' FROM phone) — strips leading zeros only.
TRIM(TRAILING '.' FROM url) — strips trailing dots only.
Custom character set — TRIM(BOTH ' #' FROM tag) removes spaces and # from both ends.

LTRIM(s) / RTRIM(s) — shorthand.

MySQL / SQL Server — LTRIM(RTRIM(s)) is the classic "trim both ends" idiom when TRIM(BOTH ...) isn't available.
Postgres — BTRIM(s) is alias for TRIM(BOTH ' ' FROM s); LTRIM(s, chars) and RTRIM(s, chars) accept custom character sets.

TRIM vs REPLACE vs REGEXP_REPLACE for cleaning.

TRIM — only affects leading/trailing characters; safe for indexed columns if applied at ingest.
REPLACE(s, ' ', '') — removes all spaces including internal ones; wrong for "John Smith".
REGEXP_REPLACE(s, '^\s+|\s+$', '', 'g') — regex trim when custom whitespace (tabs, newlines) must go.

SQL Server TRIM() (2017+).

TRIM(s) — strips spaces from both ends; SQL Server 2017 added ANSI-style TRIM.
Legacy — LTRIM(RTRIM(s)) still appears in older codebases.

SQL interview question — normalize phone numbers to digits-only

Assume leads(lead_id, phone_raw) where phone_raw contains mixed formatting ((555) 123-4567, 555.123.4567, 5551234567). Return lead_id and a phone_digits column containing only the numeric characters, with no leading/trailing whitespace.

Solution Using TRIM + REGEXP_REPLACE

Code (PostgreSQL).

SELECT lead_id,
       REGEXP_REPLACE(TRIM(phone_raw), '[^0-9]', '', 'g') AS phone_digits
FROM leads;

Equivalent in MySQL 8+.

SELECT lead_id,
       REGEXP_REPLACE(TRIM(phone_raw), '[^0-9]', '') AS phone_digits
FROM leads;

Equivalent in SQL Server (nested REPLACE fallback).

SELECT lead_id,
       REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
         LTRIM(RTRIM(phone_raw)), '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''), '+', '') AS phone_digits
FROM leads;

Step-by-step trace.

step	relation	outcome
1	`TRIM(phone_raw)`	removes leading/trailing spaces from CSV import
2	`REGEXP_REPLACE(..., '[^0-9]', '', 'g')`	strips every non-digit character globally
3	result column `phone_digits`	uniform format for downstream joins

Output:

lead_id	phone_digits
1	5551234567
2	5551234567
3	5551234567

Why this works — concept by concept:

TRIM first — whitespace-only differences shouldn't survive into the digit extraction step.
Character-class regex — [^0-9] matches any non-digit in one pass; cleaner than six nested REPLACE calls.
Global flag 'g' — without it, Postgres replaces only the first match; phone strings have multiple punctuation marks.
Idempotent output — already-clean rows (5551234567) pass through unchanged.
Cost — O(n) scan; regex per row is acceptable for staging tables; for billion-row fact tables, normalize at ingest with a persisted column.

SQL
Topic — string-manipulation
String cleaning SQL drills

Practice →

SQL
Topic — string-functions
String functions library

Practice →

7. REGEXP, LIKE, and pattern matching — validation vs extraction

`regexp in sql` for structure; `LIKE` for simple prefix/suffix — pick the right tool

Pattern matching splits into two tiers. LIKE (and Postgres ILIKE) handles simple wildcard filters ('a%@%.com'). regexp in sql handles structural validation (email format, phone pattern, extract capture groups). Reviewers test whether you know when regex is overkill.

LIKE and ILIKE — wildcard matching.

LIKE 'abc%' — starts with abc.
LIKE '%abc' — ends with abc.
LIKE '%abc%' — contains abc.
LIKE 'a_c' — _ matches exactly one character.
ILIKE '%ABC%' (Postgres) — case-insensitive contains.
Cannot validate structure — 'a%@%.com' matches 'a@@@@.com'; fine for coarse filters, wrong for validation.

Postgres POSIX regex operators.

s ~ 'pattern' — case-sensitive match.
s ~* 'pattern' — case-insensitive match.
s !~ 'pattern' — does not match.
REGEXP_MATCH(s, pattern) — returns text array of capture groups (or NULL).
REGEXP_REPLACE(s, pattern, replacement, 'g') — replace all matches.

MySQL REGEXP / REGEXP_LIKE (8.0+).

col REGEXP '^[0-9]+$' — MySQL regex syntax (not identical to Postgres POSIX).
REGEXP_LIKE(col, pattern) — ANSI-style function form.
REGEXP_SUBSTR(col, pattern) — extract matching substring (MySQL 8.0+).

SQL Server — LIKE, PATINDEX, limited regex.

LIKE '[0-9]%-%' — T-SQL character classes in LIKE patterns.
PATINDEX('%[0-9]%', col) — returns start position of first match (0 if none).
No native REGEXP_MATCH — use CLR, application layer, or Azure SQL external functions.

Validation vs filtering — interview decision table.

Goal	Tool	Example
Filter rows containing `@`	`LIKE '%@%'`	coarse email filter
Validate email structure	`~ '^[^@]+@[^@]+\.[^@]+$'`	regex in WHERE
Extract domain from email	`REGEXP_MATCH` or `SUBSTRING ... FROM '@(.+)$'`	extraction
Case-insensitive search	`ILIKE '%error%'` (Postgres)	log search
Flag invalid phone rows	`NOT (phone ~ '^[0-9]{10}$')`	data quality check

Common regex pitfalls.

Over-escaping — Postgres string literals need '\\d' for \d; E-string '\\d' or '[0-9]' is clearer in interviews.
Catastrophic backtracking — avoid nested quantifiers on large text columns.
Index unfriendly — regex in WHERE forces full scan; persist a validation flag column if queried repeatedly.

SQL
Topic — string-functions
REGEXP / LIKE SQL drills

Practice →

SQL
Topic — string-parsing
String parsing SQL practice

Practice →

8. Common string interview patterns — parsing, aggregation, validation

Eight patterns cover 80% of string-related interview questions

The remaining 20% are dialect-specific edge cases. Memorise these eight and you'll handle most prompts on the first try.

Pattern 1 — build a full name from parts.

TRIM(CONCAT(first_name, ' ', COALESCE(middle_name || ' ', ''), last_name)) (Postgres with ||).
Or CONCAT_WS(' ', first_name, middle_name, last_name) (skips NULLs in MySQL/SQL Server style).

Pattern 2 — extract domain from email.

Postgres — (REGEXP_MATCH(email, '@(.+)$'))[1] or SPLIT_PART(email, '@', 2).
MySQL — SUBSTRING_INDEX(email, '@', -1).
SQL Server — RIGHT(email, LEN(email) - CHARINDEX('@', email)).

Pattern 3 — split delimited string into rows.

Postgres — unnest(string_to_array(tags, ',')).
MySQL — recursive CTE or JSON_TABLE (8.0+).
SQL Server — STRING_SPLIT(tags, ',') (2016+).

Pattern 4 — aggregate tags back into one string.

Postgres — STRING_AGG(tag, ', ' ORDER BY tag).
MySQL — GROUP_CONCAT(tag ORDER BY tag SEPARATOR ', ').
SQL Server — STRING_AGG(tag, ', ') WITHIN GROUP (ORDER BY tag).

Pattern 5 — pad to fixed width.

Postgres — LPAD(id::text, 6, '0') → '000042'.
MySQL — LPAD(id, 6, '0').
SQL Server — RIGHT('000000' + CAST(id AS VARCHAR), 6).

Pattern 6 — case normalization for joins.

LOWER(TRIM(email)) on both sides of the join — or persist a email_normalized column at ingest.

Pattern 7 — find rows failing a validation rule.

SELECT *
FROM users
WHERE email !~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

Pattern 8 — mask sensitive data in a report.

Postgres — REGEXP_REPLACE(ssn, '(\d{3})-(\d{2})-(\d{4})', '***-**-\3').
Shows last four digits only; common compliance interview prompt.

SQL
Topic — string-parsing
String parsing drills

Practice →

SQL
Topic — string-concatenation
String concatenation patterns

Practice →

Choosing the right string function (cheat sheet)

A one-screen cheat sheet for using SQL string functions — pick the family that matches your intent, then the dialect-specific function.

You want to …	Function family	Postgres / Standard	MySQL	SQL Server
Join strings (NULL-safe)	Concatenation	`CONCAT(a, b, c)`	`CONCAT(a, b, c)`	`CONCAT(a, b, c)`
Join with separator (skip NULLs)	Concatenation	`CONCAT_WS(', ', a, b)`	`CONCAT_WS(', ', a, b)`	`CONCAT_WS(', ', a, b)`
Join strings (NULL-poisoning)	Concatenation	`a \	\	b`
Extract middle of string	Slicing	`SUBSTRING(s FROM 2 FOR 3)`	`SUBSTRING(s, 2, 3)`	`SUBSTRING(s, 2, 3)`
First / last N characters	Slicing	`LEFT(s, n)` / `RIGHT(s, n)`	`LEFT(s, n)` / `RIGHT(s, n)`	`LEFT(s, n)` / `RIGHT(s, n)`
Split on delimiter (field N)	Slicing	`SPLIT_PART(s, ',', 2)`	`SUBSTRING_INDEX(...)`	`STRING_SPLIT` + row number
Swap literal substring	Replacement	`REPLACE(s, 'old', 'new')`	`REPLACE(s, 'old', 'new')`	`REPLACE(s, 'old', 'new')`
Strip character set	Replacement	`TRANSLATE(s, '012', '')`	nested `REPLACE`	nested `REPLACE`
Pattern-based swap	Replacement	`REGEXP_REPLACE(s, pat, rep, 'g')`	`REGEXP_REPLACE(s, pat, rep)`	nested `REPLACE` / CLR
Strip leading/trailing spaces	Cleaning	`TRIM(s)` / `BTRIM(s)`	`TRIM(s)`	`TRIM(s)` (2017+)
Strip one side only	Cleaning	`LTRIM(s)` / `RTRIM(s)`	`LTRIM(s)` / `RTRIM(s)`	`LTRIM(s)` / `RTRIM(s)`
Simple wildcard filter	Pattern	`LIKE 'abc%'`	`LIKE 'abc%'`	`LIKE 'abc%'`
Case-insensitive filter	Pattern	`ILIKE '%abc%'`	`LOWER(s) LIKE '%abc%'`	`LOWER(s) LIKE '%abc%'`
Regex match / validate	Pattern	`s ~ '^pattern$'`	`REGEXP_LIKE(s, pat)`	`PATINDEX('%pat%', s) > 0`
Extract capture group	Pattern	`(REGEXP_MATCH(s, pat))[1]`	`REGEXP_SUBSTR(s, pat)`	application layer
Aggregate strings	Aggregation	`STRING_AGG(col, ', ')`	`GROUP_CONCAT(col)`	`STRING_AGG(col, ', ')`

Frequently asked questions

What's the difference between CONCAT and || in PostgreSQL?

CONCAT(a, b, c) is variadic and NULL-safe — it treats NULL arguments as empty strings, so CONCAT('a', NULL, 'b') returns 'ab'. The || operator follows standard SQL NULL propagation — if any operand is NULL, the entire result is NULL. Use || when you want strict NULL semantics and will wrap nullable columns with COALESCE. Use CONCAT() or CONCAT_WS() when building display strings from nullable name parts. MySQL's CONCAT() returns NULL if any argument is NULL (different from Postgres); use CONCAT_WS() there to skip NULLs. SQL Server's CONCAT() (2012+) matches Postgres null-safe behavior.

Is SUBSTRING 0-based or 1-based in SQL?

1-based in every major dialect — SUBSTRING('abcdef', 1, 3) returns 'abc', and position 1 is the first character. This differs from Python, JavaScript, and most programming languages where indexing starts at 0. Postgres also supports negative start positions (SUBSTRING(s FROM -3) returns the last three characters). When extracting after a delimiter, compute POSITION('@' IN email) + 1 as the start — the + 1 skips the delimiter itself. Saying "start is 1-based" aloud in an interview prevents the most common off-by-one bug.

When should I use REPLACE vs REGEXP_REPLACE?

Use REPLACE(s, old, new) when swapping a known literal substring — domain rewrites, removing hyphens, fixing a consistent typo. It's faster and easier to read. Use REGEXP_REPLACE(s, pattern, replacement, 'g') when the pattern is variable — strip all non-digits ('[^0-9]'), collapse whitespace ('\s+'), mask SSN digits. In Postgres, remember the 'g' flag for global replacement; without it, only the first match is replaced. For character-set removal (all punctuation), Postgres TRANSLATE() is often cleaner than regex.

How do I trim custom characters, not just spaces?

ANSI syntax: TRIM(LEADING '0' FROM phone) strips leading zeros; TRIM(BOTH ' #' FROM tag) strips spaces and # from both ends. Postgres adds LTRIM(s, chars) and RTRIM(s, chars) with a custom character set argument. MySQL's TRIM() only removes spaces — for custom characters use TRIM(LEADING '0' FROM col) (MySQL 8.0 supports the full ANSI form) or nested REPLACE. SQL Server 2017+ supports TRIM() for spaces; for custom chars use TRIM(' #' FROM col) in compatible modes or LTRIM(RTRIM(REPLACE(s, '#', ''))) as a fallback.

What's the difference between REGEXP and LIKE in SQL?

LIKE uses % (any-length wildcard) and _ (single character) for simple prefix/suffix/contains filters — 'a%@%.com' matches strings starting with a and containing @ then .com, but cannot enforce full email structure. regexp in sql (Postgres ~, MySQL REGEXP_LIKE) matches against a full regular expression — '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' validates email shape. Use LIKE for coarse filtering on indexed columns where possible; use regex for validation flags, extraction (REGEXP_MATCH), and complex replacement. Postgres ILIKE adds case-insensitive LIKE; other dialects emulate with LOWER(col) LIKE LOWER(pattern).

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to string functions (CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP), string manipulation for cleaning and normalization, string parsing for delimiter splits and extraction, and the dialect quirks that fail candidates who memorise only one engine. Whether you're drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same five-family taxonomy this guide teaches.

Kick off via Explore practice →; drill the dedicated string-functions SQL lane →; fan out into the string-manipulation SQL lane →; rehearse string-parsing drills →; sharpen string-concatenation patterns →; widen coverage on the full SQL practice library →.

DEV Community

SQL String Functions: CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP

1. Why SQL string functions matter in data engineering interviews

The five string function families — that's the whole interview surface

Worked example — a single query that uses all five families

2. SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior

Pick the string type that matches the column — half the bugs are wrong-type bugs

3. CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server

`concat sql` is where NULL-handling diverges — know all three operators

SQL interview question — build a display name with optional middle initial

Solution Using CONCAT_WS + NULLIF + TRIM

4. SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART

`substring sql` is 1-based everywhere — say that aloud in interviews

5. REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns

`sql replace` handles literals; reach for regex when the pattern is variable

6. TRIM, LTRIM, RTRIM — cleaning dirty text columns

`sql trim` is the first function in every cleaning pipeline — know all three directions

SQL interview question — normalize phone numbers to digits-only

Solution Using TRIM + REGEXP_REPLACE

7. REGEXP, LIKE, and pattern matching — validation vs extraction

`regexp in sql` for structure; `LIKE` for simple prefix/suffix — pick the right tool

8. Common string interview patterns — parsing, aggregation, validation

Eight patterns cover 80% of string-related interview questions

Choosing the right string function (cheat sheet)

Frequently asked questions

What's the difference between CONCAT and || in PostgreSQL?

Is SUBSTRING 0-based or 1-based in SQL?

When should I use REPLACE vs REGEXP_REPLACE?

How do I trim custom characters, not just spaces?

What's the difference between REGEXP and LIKE in SQL?

Practice on PipeCode

Top comments (0)

1. Why SQL string functions matter in data engineering interviews

The five string function families — that's the whole interview surface

Worked example — a single query that uses all five families

2. SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior

Pick the string type that matches the column — half the bugs are wrong-type bugs

3. CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server

concat sql is where NULL-handling diverges — know all three operators

SQL interview question — build a display name with optional middle initial

Solution Using CONCAT_WS + NULLIF + TRIM

4. SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART

substring sql is 1-based everywhere — say that aloud in interviews

5. REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns

sql replace handles literals; reach for regex when the pattern is variable

6. TRIM, LTRIM, RTRIM — cleaning dirty text columns

sql trim is the first function in every cleaning pipeline — know all three directions

SQL interview question — normalize phone numbers to digits-only

Solution Using TRIM + REGEXP_REPLACE

7. REGEXP, LIKE, and pattern matching — validation vs extraction

regexp in sql for structure; LIKE for simple prefix/suffix — pick the right tool

8. Common string interview patterns — parsing, aggregation, validation

Eight patterns cover 80% of string-related interview questions

Choosing the right string function (cheat sheet)

Frequently asked questions

What's the difference between CONCAT and || in PostgreSQL?

Is SUBSTRING 0-based or 1-based in SQL?

When should I use REPLACE vs REGEXP_REPLACE?

How do I trim custom characters, not just spaces?

What's the difference between REGEXP and LIKE in SQL?

Practice on PipeCode

`concat sql` is where NULL-handling diverges — know all three operators

`substring sql` is 1-based everywhere — say that aloud in interviews

`sql replace` handles literals; reach for regex when the pattern is variable

`sql trim` is the first function in every cleaning pipeline — know all three directions

`regexp in sql` for structure; `LIKE` for simple prefix/suffix — pick the right tool