DEV Community

Cover image for SQL String Functions: CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP
Gowtham Potureddi
Gowtham Potureddi

Posted on

SQL String Functions: CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP

sql string functions split into five families that every data engineering pipeline eventually touches: concatenation (concat sqlCONCAT, ||, +), slicing (substring sqlSUBSTRING, SUBSTR, LEFT, RIGHT, SPLIT_PART), replacement (sql replaceREPLACE, TRANSLATE, REGEXP_REPLACE), cleaning (sql trimTRIM, LTRIM, RTRIM, BTRIM), and pattern matching (regexp in sql~, REGEXP_LIKE, LIKE, ILIKE). Add delimiter parsing with STRING_AGG / GROUP_CONCAT / STRING_SPLIT and you can answer almost every text-normalization sql interview questions prompt in a panel — from "build a full name column" to "extract the domain from an email" to "flag rows that fail a regex validation rule."

This guide walks through every clause in the string-function family that reviewers test in data engineering interview questions: the VARCHAR / TEXT / CHAR type hierarchy and NULL propagation rules, the dialect jungle around concat sql (CONCAT null-safe vs || vs SQL Server +), substring sql indexing (1-based everywhere except a few legacy APIs), sql replace vs TRANSLATE vs REGEXP_REPLACE for normalization, sql trim variants (BOTH / LEADING / TRAILING and custom trim characters), and regexp in sql vs LIKE for validation and extraction — the patterns every sql for data engineers loop tests. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown.

PipeCode blog header for a SQL string functions tutorial — bold white headline 'SQL String Functions' with subtitle 'CONCAT · SUBSTRING · REPLACE · TRIM · REGEXP' and a minimal code snippet on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse string-functions SQL practice →, drill the string-manipulation lane →, sharpen string-parsing drills →, rehearse string-concatenation patterns →, or widen coverage on the full SQL practice library →.


On this page


1. Why SQL string functions matter in data engineering interviews

The five string function families — that's the whole interview surface

The one-sentence invariant: sql string functions split into five families — concatenation (CONCAT, ||), slicing (SUBSTRING, LEFT, RIGHT), replacement (REPLACE, TRANSLATE, REGEXP_REPLACE), cleaning (TRIM, LTRIM, RTRIM), and pattern matching (regexp in sql, LIKE, ILIKE) — plus aggregation/parsing with STRING_AGG and SPLIT_PART. Once you internalise that taxonomy, every text-related interview prompt collapses to "pick the right function from the right family."

The five function families.

  • ConcatenationCONCAT(a, b, c), a || b (Postgres / Oracle), CONCAT(a, b) (MySQL / SQL Server 2012+), legacy a + b (SQL Server with NULL trap).
  • SlicingSUBSTRING(s FROM start FOR len), SUBSTR(s, start, len), LEFT(s, n), RIGHT(s, n), SPLIT_PART(s, delim, n).
  • ReplacementREPLACE(s, old, new), TRANSLATE(s, from, to) (char-by-char map), REGEXP_REPLACE(s, pattern, replacement).
  • CleaningTRIM(BOTH ' ' FROM s), LTRIM(s), RTRIM(s), BTRIM(s, chars).
  • Pattern matchings ~ 'regex' (Postgres), REGEXP_LIKE(s, pattern) (MySQL / Oracle), s LIKE '%pattern%', s ILIKE '%pattern%' (Postgres case-insensitive).

Why interviewers love string functions.

  • Raw data is messy — leading spaces, mixed casing, inconsistent delimiters, embedded newlines, phone numbers stored ten different ways.
  • Dialect divergence is realconcat sql null-handling differs across engines; substring sql start positions are 1-based in SQL but easy to confuse with 0-based languages.
  • Regex support varies — Postgres uses POSIX regex; MySQL has its own engine; SQL Server leans on LIKE + PATINDEX unless you enable CLR regex.
  • Performance traps are common — wrapping an indexed column in LOWER(email) defeats the index; functional indexes or normalized storage columns are the fix.

What interviewers listen for.

  • Do you name the family for the function you're picking? — "this is a cleaning step, so I'll reach for TRIM before REPLACE."
  • Do you mention NULL propagation when concatenating? — senior signal.
  • Do you use REGEXP_REPLACE only when literal REPLACE won't scale? — shows you know the cost trade-off.
  • Do you flag 1-based SUBSTRING indexing aloud? — avoids off-by-one bugs in live coding.

Worked example — a single query that uses all five families

Question. From contacts(contact_id, first_name, last_name, email_raw), return a trimmed full name, a lowercased email with @company.com replaced by @corp.com, and the email domain extracted via regex.

Input. contacts slice.

contact_id first_name last_name email_raw
1 Alice Smith ALICE@COMPANY.COM
2 Bob NULL bob@test.org
3 Carol Lee carol@company.com

Code (PostgreSQL).

SELECT contact_id,
       TRIM(CONCAT(TRIM(first_name), ' ', TRIM(COALESCE(last_name, ''))))     AS full_name,           -- cleaning + concat
       LOWER(REPLACE(TRIM(email_raw), '@company.com', '@corp.com'))           AS normalized_email,    -- cleaning + replace
       SUBSTRING(TRIM(email_raw) FROM '@(.+)$')                               AS domain_via_substring, -- slicing (regex capture)
       (REGEXP_MATCH(TRIM(email_raw), '@(.+)$'))[1]                           AS domain_via_regexp     -- pattern matching
FROM contacts;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. TRIM(first_name) and TRIM(last_name) strip leading/trailing spaces (cleaning family).
  2. CONCAT(..., ' ', ...) joins first and last name; COALESCE(last_name, '') prevents NULL from nulling the whole expression in || contexts.
  3. REPLACE(TRIM(email_raw), '@company.com', '@corp.com') swaps the domain literal (replacement family).
  4. SUBSTRING(... FROM '@(.+)$') extracts the substring matching the POSIX regex capture (slicing + pattern).
  5. REGEXP_MATCH(...)[1] pulls capture group 1 as an array element (pattern matching family).

Output.

contact_id full_name normalized_email domain_via_substring domain_via_regexp
1 Alice Smith alice@corp.com company.com company.com
2 Bob bob@test.org test.org test.org
3 Carol Lee carol@corp.com company.com company.com

Rule of thumb: identify the family first, then pick the dialect's function — a far more reliable mental model than memorising every function signature.

SQL
Topic — string-functions
String functions SQL drills

Practice →

SQL
Topic — string-manipulation
String manipulation SQL practice

Practice →


2. SQL string types — VARCHAR, TEXT, CHAR, and NULL behavior

Diagram of the SQL string type hierarchy — four labelled cards in a row (CHAR, VARCHAR, TEXT, BYTEA/BLOB) with sample values and a one-line description each, on a light PipeCode card.

Pick the string type that matches the column — half the bugs are wrong-type bugs

Before any function call comes the question of which type the column should be. The string type family is small but the dialect names diverge — knowing when to use VARCHAR(n) vs TEXT vs fixed CHAR(n) is what separates candidates who've shipped production pipelines from candidates who've only read tutorials.

CHAR(n) — fixed-length, space-padded.

  • Stores — exactly n characters; shorter values are right-padded with spaces.
  • Use it for — legacy fixed-width codes (country codes, status flags) where length is invariant.
  • The gotcha'US' stored in CHAR(10) compares unequal to 'US' in VARCHAR because of trailing spaces; always TRIM before comparing.

VARCHAR(n) / NVARCHAR(n) — variable-length with optional cap.

  • Stores — up to n characters without padding.
  • MySQLVARCHAR(255) is the classic indexed-column choice; longer values spill to off-page storage.
  • SQL ServerVARCHAR(n) for ASCII; NVARCHAR(n) for Unicode (UTF-16); VARCHAR(MAX) for large text.
  • Use it for — names, emails, SKUs, any text with a known reasonable max length.

TEXT / CLOB — unlimited (or very large) variable-length.

  • PostgreSQLTEXT has no length limit and performs identically to VARCHAR without a length constraint; prefer TEXT unless you need a CHECK constraint on length.
  • MySQLTEXT, MEDIUMTEXT, LONGTEXT tiers with different max sizes.
  • Use it for — JSON payloads, log lines, free-form notes — columns you won't index directly.

BYTEA / BLOB — binary, not text.

  • Not a string type — don't apply TRIM or CONCAT to binary columns; decode first with CONVERT_FROM(bytes, 'UTF8') (Postgres) or CAST(... AS VARCHAR).

NULL propagation — the universal concatenation trap.

  • CONCAT(a, b) (Postgres / SQL Server 2012+) — NULL-safe; treats NULL arguments as empty strings; CONCAT('a', NULL, 'b')'ab'.
  • CONCAT(a, b) (MySQL) — if any argument is NULL, result is NULL; use CONCAT_WS() to skip NULLs.
  • a || b (Postgres / Oracle) — NULL poisons the result; 'a' || NULL || 'b' → NULL.
  • SQL Server a + b — NULL poisons; use CONCAT() or CONCAT_WS() instead.
  • Fix in Postgres with ||COALESCE(a, '') || COALESCE(b, '') or use the built-in CONCAT() function (Postgres 9.1+).

Collation and case sensitivity.

  • Default collation determines whether 'ABC' = 'abc'; string comparisons in interviews usually assume case-sensitive unless stated.
  • Postgres ILIKE — case-insensitive pattern match; no ANSI equivalent.
  • Normalize early — store LOWER(email) in a generated column if you filter on it constantly.

Common type pitfalls.

  • Storing numbers as VARCHAR — sorts lexicographically ('10' < '2'); cast before arithmetic.
  • Comparing CHAR without TRIM — silent mismatches on padded values.
  • Applying string functions to NULL — most return NULL; use COALESCE at the boundary.

SQL
Topic — string-functions
String functions library

Practice →

SQL
Topic — string-manipulation
String manipulation drills

Practice →


3. CONCAT and concatenation — across PostgreSQL, MySQL, SQL Server

Diagram of CONCAT signatures across three dialects — PostgreSQL, MySQL, and SQL Server — shown as three side-by-side code cards illustrating CONCAT vs || vs + and NULL-handling differences, on a light PipeCode card.

concat sql is where NULL-handling diverges — know all three operators

The concat sql family is deceptively simple until NULL appears. PostgreSQL offers both CONCAT() (null-safe) and || (NULL-poisoning). MySQL's CONCAT() skips NULLs. SQL Server historically used + (NULL-poisoning) before CONCAT() arrived in 2012. Knowing all three is what separates candidates who've worked across engines.

PostgreSQL — CONCAT() vs ||.

  • CONCAT(a, b, c) — variadic, NULL-safe; skips NULL arguments.
  • a || b || c — ANSI concatenation operator; any NULL → entire result NULL.
  • CONCAT_WS(sep, a, b, c) — "concat with separator"; skips NULL parts and separator between them.
  • ExampleCONCAT(first_name, ' ', last_name) vs first_name || ' ' || last_name (requires COALESCE on nullable columns).

MySQL — CONCAT() and CONCAT_WS().

  • CONCAT(str1, str2, ...) — if any argument is NULL, the entire result is NULL (stricter than Postgres CONCAT).
  • CONCAT_WS(sep, str1, str2, ...) — skips NULL arguments and their separators; the MySQL fix for nullable name parts.
  • GROUP_CONCAT(col SEPARATOR ',') — aggregate concatenation; the MySQL answer to STRING_AGG.

SQL Server — CONCAT(), CONCAT_WS(), legacy +.

  • CONCAT(a, b, c) — SQL Server 2012+; NULL-safe (treats NULL as empty string).
  • a + b — NULL-poisoning; legacy code still uses it.
  • CONCAT_WS(sep, a, b) — SQL Server 2017+; separator between non-NULL values.
  • STRING_AGG(col, ',') — aggregate concatenation (SQL Server 2017+).

Snowflake / BigQuery equivalents.

  • SnowflakeCONCAT(a, b) or a || b (both NULL-safe with IFNULL patterns); LISTAGG(col, ',') for aggregates.
  • BigQueryCONCAT(a, b); NULL → NULL unless you wrap with IFNULL; STRING_AGG(col, ',') for aggregates.

When to reach for concatenation vs a template.

  • Simple joinCONCAT(first, ' ', last).
  • Conditional piecesCONCAT_WS(' ', NULLIF(middle, ''), last) skips blank middle names cleanly.
  • Aggregate rollupsSTRING_AGG(tag, ', ' ORDER BY tag) (Postgres) instead of concatenating in application code.

SQL interview question — build a display name with optional middle initial

Assume employees(emp_id, first_name, middle_name, last_name). Return emp_id and a display_name formatted as Last, First M. — omit the middle initial entirely when middle_name is NULL or empty; trim all parts first.

Solution Using CONCAT_WS + NULLIF + TRIM

Code (PostgreSQL).

SELECT emp_id,
       CONCAT_WS(', ',
                 TRIM(last_name),
                 CONCAT_WS(' ',
                           TRIM(first_name),
                           NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
                          )
                ) AS display_name
FROM employees;
Enter fullscreen mode Exit fullscreen mode

Equivalent in MySQL.

SELECT emp_id,
       CONCAT_WS(', ',
                 TRIM(last_name),
                 CONCAT_WS(' ',
                           TRIM(first_name),
                           NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
                          )
                ) AS display_name
FROM employees;
Enter fullscreen mode Exit fullscreen mode

Equivalent in SQL Server.

SELECT emp_id,
       CONCAT_WS(', ',
                 TRIM(last_name),
                 CONCAT_WS(' ',
                           TRIM(first_name),
                           NULLIF(CONCAT(LEFT(TRIM(middle_name), 1), '.'), '.')
                          )
                ) AS display_name
FROM employees;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 TRIM each name part strips padding spaces from CHAR columns
2 LEFT(TRIM(middle_name), 1) first character of middle name
3 CONCAT(..., '.') then NULLIF(..., '.') produces 'M.' or NULL when middle is empty
4 inner CONCAT_WS(' ', first, middle_initial) joins first name and optional initial; skips NULL middle
5 outer CONCAT_WS(', ', last, first_block) produces 'Smith, Alice M.' format

Output:

emp_id display_name
1 Smith, Alice M.
2 Jones, Bob
3 Lee, Carol A.

Why this works — concept by concept:

  • TRIM before concat — fixed-width CHAR columns and CSV imports often carry trailing spaces that would break display formatting.
  • LEFT + CONCAT for initial — extracts one character and appends a period; simpler than SUBSTRING(middle_name FROM 1 FOR 1).
  • NULLIF(..., '.') — when middle_name is NULL or empty, LEFT returns empty string, CONCAT yields '.', and NULLIF converts that sentinel to NULL so CONCAT_WS skips it.
  • CONCAT_WS skips NULLs — no manual CASE WHEN middle_name IS NOT NULL wrapper needed; cleaner and dialect-portable.
  • Cost — single projection over the table; O(n) row scan with constant-time string ops per row.

SQL
Topic — string-concatenation
String concatenation drills

Practice →

SQL
Topic — string-functions
String functions SQL practice

Practice →


4. SUBSTRING and slicing — SUBSTR, LEFT, RIGHT, SPLIT_PART

Diagram of SUBSTRING indexing — a string with 1-based position markers showing START and LENGTH parameters, plus LEFT/RIGHT/SPLIT_PART callouts, on a light PipeCode card.

substring sql is 1-based everywhere — say that aloud in interviews

The slicing family extracts a portion of a string. Every major SQL dialect uses 1-based start positions (unlike Python's 0-based indexing). Negative start positions work in some dialects (Postgres SUBSTRING(s FROM -3) counts from the end). Off-by-one errors are the #1 bug in live coding — state "start is 1-based" before you write the query.

ANSI SUBSTRING(string FROM start [FOR length]).

  • Postgres / OracleSUBSTRING('abcdef' FROM 2 FOR 3)'bcd'.
  • MySQLSUBSTRING('abcdef', 2, 3) or SUBSTR('abcdef', 2, 3) (function-call form).
  • SQL ServerSUBSTRING('abcdef', 2, 3) (same as MySQL).
  • Omit length — returns from start to end of string.

LEFT(s, n) and RIGHT(s, n) — convenience wrappers.

  • LEFT('abcdef', 3)'abc'; RIGHT('abcdef', 3)'def'.
  • Supported by — MySQL, SQL Server, Postgres (via compatible functions), Snowflake.
  • Use when — you need a fixed prefix or suffix, not a middle slice.

SPLIT_PART(s, delimiter, field_number) — Postgres / Snowflake.

  • SPLIT_PART('a,b,c', ',', 2)'b' (1-based field index).
  • MySQL equivalentSUBSTRING_INDEX(SUBSTRING_INDEX(s, ',', 2), ',', -1).
  • SQL Server equivalentSTRING_SPLIT (2016+) returns a table; use ROW_NUMBER() to pick field n.

POSITION(substring IN string) / CHARINDEX / INSTR — find then slice.

  • PatternSUBSTRING(s FROM POSITION('@' IN s) + 1) extracts everything after @.
  • PostgresPOSITION('@' IN email); SQL ServerCHARINDEX('@', email); MySQLLOCATE('@', email).
  • Safer with regex — when delimiter appears multiple times, use SPLIT_PART or REGEXP_REPLACE instead.

Negative indexing (Postgres only).

  • SUBSTRING('abcdef' FROM -2)'ef' (last two characters).
  • Portable alternativeRIGHT(s, 2).

Common slicing pitfalls.

  • 0-based habit from PythonSUBSTRING(s, 0, 3) in SQL often returns unexpected results or empty string.
  • Unicode code points vs bytesSUBSTRING operates on characters in Postgres; byte slicing needs SUBSTRING(bytes FROM ... FOR ...).
  • Empty string vs NULLSUBSTRING(NULL, 1, 3) → NULL; SUBSTRING('', 1, 3)''.

SQL
Topic — string-parsing
String parsing drills

Practice →

SQL
Topic — string-manipulation
String manipulation SQL practice

Practice →


5. REPLACE, TRANSLATE, and REGEXP_REPLACE — normalization patterns

sql replace handles literals; reach for regex when the pattern is variable

The replacement family swaps one substring for another. sql replace with the literal REPLACE() function handles fixed strings (REPLACE(phone, '-', '')). TRANSLATE() maps character-by-character (great for stripping punctuation). REGEXP_REPLACE() handles pattern-based swaps (normalize whitespace, mask digits).

REPLACE(string, old, new) — literal swap, all occurrences.

  • SignatureREPLACE(s, 'old', 'new'); replaces every occurrence of old.
  • Case-sensitive'ABC' won't match 'abc'; wrap with LOWER() if case-insensitive replacement is needed.
  • Remove a substringREPLACE(s, '-', '') deletes every hyphen.
  • Supported everywhere — Postgres, MySQL, SQL Server, Oracle, Snowflake, BigQuery.

TRANSLATE(string, from, to) — character-by-character map.

  • Postgres / OracleTRANSLATE('a1b2', '0123456789', '') strips all digits.
  • SQL Server — no native TRANSLATE; emulate with nested REPLACE or REGEXP_REPLACE.
  • Use when — removing a set of characters (punctuation, digits) rather than a single literal.

REGEXP_REPLACE(string, pattern, replacement [, flags]) — pattern swap.

  • PostgresREGEXP_REPLACE(s, '\s+', ' ', 'g') collapses multiple spaces; flag 'g' replaces all matches.
  • MySQLREGEXP_REPLACE(s, '[0-9]', 'X') (MySQL 8.0+).
  • SQL Server — no built-in until recent versions; use PATINDEX loops or application-layer regex.
  • Use when — the pattern is variable (any digit, any whitespace run, email domain swap with regex).

REPLACE vs REGEXP_REPLACE — decision tree.

  • Known literalREPLACE(email, '@old.com', '@new.com').
  • Character classREGEXP_REPLACE(phone, '[^0-9]', '', 'g') keeps digits only.
  • Performance — literal REPLACE is faster; regex has compile overhead per row.

Chained normalization pattern (production pipelines).

SELECT REGEXP_REPLACE(
         TRIM(
           REPLACE(
             REPLACE(raw_name, CHR(9), ' '),  -- tabs to spaces
             '  ', ' '                          -- double spaces (first pass)
           )
         ),
         '\s+', ' ', 'g'                        -- collapse remaining whitespace
       ) AS clean_name
FROM staging.users;
Enter fullscreen mode Exit fullscreen mode

SQL
Topic — string-functions
String functions SQL drills

Practice →

SQL
Topic — string-parsing
String parsing patterns

Practice →


6. TRIM, LTRIM, RTRIM — cleaning dirty text columns

Diagram of TRIM variants — three strings showing LEADING trim, TRAILING trim, and BOTH trim with custom character sets, on a light PipeCode card.

sql trim is the first function in every cleaning pipeline — know all three directions

Dirty text is the default in raw ingestion tables. sql trim removes leading and trailing characters — usually spaces, but you can specify any character set. Always TRIM before CONCAT, REPLACE, or regex so you don't propagate invisible padding.

ANSI TRIM([LEADING | TRAILING | BOTH] [chars] FROM string).

  • TRIM(BOTH ' ' FROM s) — strips spaces from both ends (default when you write TRIM(s) in Postgres).
  • TRIM(LEADING '0' FROM phone) — strips leading zeros only.
  • TRIM(TRAILING '.' FROM url) — strips trailing dots only.
  • Custom character setTRIM(BOTH ' #' FROM tag) removes spaces and # from both ends.

LTRIM(s) / RTRIM(s) — shorthand.

  • MySQL / SQL ServerLTRIM(RTRIM(s)) is the classic "trim both ends" idiom when TRIM(BOTH ...) isn't available.
  • PostgresBTRIM(s) is alias for TRIM(BOTH ' ' FROM s); LTRIM(s, chars) and RTRIM(s, chars) accept custom character sets.

TRIM vs REPLACE vs REGEXP_REPLACE for cleaning.

  • TRIM — only affects leading/trailing characters; safe for indexed columns if applied at ingest.
  • REPLACE(s, ' ', '') — removes all spaces including internal ones; wrong for "John Smith".
  • REGEXP_REPLACE(s, '^\s+|\s+$', '', 'g') — regex trim when custom whitespace (tabs, newlines) must go.

SQL Server TRIM() (2017+).

  • TRIM(s) — strips spaces from both ends; SQL Server 2017 added ANSI-style TRIM.
  • LegacyLTRIM(RTRIM(s)) still appears in older codebases.

SQL interview question — normalize phone numbers to digits-only

Assume leads(lead_id, phone_raw) where phone_raw contains mixed formatting ((555) 123-4567, 555.123.4567, 5551234567). Return lead_id and a phone_digits column containing only the numeric characters, with no leading/trailing whitespace.

Solution Using TRIM + REGEXP_REPLACE

Code (PostgreSQL).

SELECT lead_id,
       REGEXP_REPLACE(TRIM(phone_raw), '[^0-9]', '', 'g') AS phone_digits
FROM leads;
Enter fullscreen mode Exit fullscreen mode

Equivalent in MySQL 8+.

SELECT lead_id,
       REGEXP_REPLACE(TRIM(phone_raw), '[^0-9]', '') AS phone_digits
FROM leads;
Enter fullscreen mode Exit fullscreen mode

Equivalent in SQL Server (nested REPLACE fallback).

SELECT lead_id,
       REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
         LTRIM(RTRIM(phone_raw)), '(', ''), ')', ''), '-', ''), '.', ''), ' ', ''), '+', '') AS phone_digits
FROM leads;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 TRIM(phone_raw) removes leading/trailing spaces from CSV import
2 REGEXP_REPLACE(..., '[^0-9]', '', 'g') strips every non-digit character globally
3 result column phone_digits uniform format for downstream joins

Output:

lead_id phone_digits
1 5551234567
2 5551234567
3 5551234567

Why this works — concept by concept:

  • TRIM first — whitespace-only differences shouldn't survive into the digit extraction step.
  • Character-class regex[^0-9] matches any non-digit in one pass; cleaner than six nested REPLACE calls.
  • Global flag 'g' — without it, Postgres replaces only the first match; phone strings have multiple punctuation marks.
  • Idempotent output — already-clean rows (5551234567) pass through unchanged.
  • CostO(n) scan; regex per row is acceptable for staging tables; for billion-row fact tables, normalize at ingest with a persisted column.

SQL
Topic — string-manipulation
String cleaning SQL drills

Practice →

SQL
Topic — string-functions
String functions library

Practice →


7. REGEXP, LIKE, and pattern matching — validation vs extraction

Side-by-side diagram of REGEXP vs LIKE — left card shows regex matching an email pattern with capture groups; right card shows LIKE with % wildcards, on a light PipeCode card.

regexp in sql for structure; LIKE for simple prefix/suffix — pick the right tool

Pattern matching splits into two tiers. LIKE (and Postgres ILIKE) handles simple wildcard filters ('a%@%.com'). regexp in sql handles structural validation (email format, phone pattern, extract capture groups). Reviewers test whether you know when regex is overkill.

LIKE and ILIKE — wildcard matching.

  • LIKE 'abc%' — starts with abc.
  • LIKE '%abc' — ends with abc.
  • LIKE '%abc%' — contains abc.
  • LIKE 'a_c'_ matches exactly one character.
  • ILIKE '%ABC%' (Postgres) — case-insensitive contains.
  • Cannot validate structure'a%@%.com' matches 'a@@@@.com'; fine for coarse filters, wrong for validation.

Postgres POSIX regex operators.

  • s ~ 'pattern' — case-sensitive match.
  • s ~* 'pattern' — case-insensitive match.
  • s !~ 'pattern' — does not match.
  • REGEXP_MATCH(s, pattern) — returns text array of capture groups (or NULL).
  • REGEXP_REPLACE(s, pattern, replacement, 'g') — replace all matches.

MySQL REGEXP / REGEXP_LIKE (8.0+).

  • col REGEXP '^[0-9]+$' — MySQL regex syntax (not identical to Postgres POSIX).
  • REGEXP_LIKE(col, pattern) — ANSI-style function form.
  • REGEXP_SUBSTR(col, pattern) — extract matching substring (MySQL 8.0+).

SQL Server — LIKE, PATINDEX, limited regex.

  • LIKE '[0-9]%-%' — T-SQL character classes in LIKE patterns.
  • PATINDEX('%[0-9]%', col) — returns start position of first match (0 if none).
  • No native REGEXP_MATCH — use CLR, application layer, or Azure SQL external functions.

Validation vs filtering — interview decision table.

Goal Tool Example
Filter rows containing @ LIKE '%@%' coarse email filter
Validate email structure ~ '^[^@]+@[^@]+\.[^@]+$' regex in WHERE
Extract domain from email REGEXP_MATCH or SUBSTRING ... FROM '@(.+)$' extraction
Case-insensitive search ILIKE '%error%' (Postgres) log search
Flag invalid phone rows NOT (phone ~ '^[0-9]{10}$') data quality check

Common regex pitfalls.

  • Over-escaping — Postgres string literals need '\\d' for \d; E-string '\\d' or '[0-9]' is clearer in interviews.
  • Catastrophic backtracking — avoid nested quantifiers on large text columns.
  • Index unfriendly — regex in WHERE forces full scan; persist a validation flag column if queried repeatedly.

SQL
Topic — string-functions
REGEXP / LIKE SQL drills

Practice →

SQL
Topic — string-parsing
String parsing SQL practice

Practice →


8. Common string interview patterns — parsing, aggregation, validation

Eight patterns cover 80% of string-related interview questions

The remaining 20% are dialect-specific edge cases. Memorise these eight and you'll handle most prompts on the first try.

Pattern 1 — build a full name from parts.

  • TRIM(CONCAT(first_name, ' ', COALESCE(middle_name || ' ', ''), last_name)) (Postgres with ||).
  • Or CONCAT_WS(' ', first_name, middle_name, last_name) (skips NULLs in MySQL/SQL Server style).

Pattern 2 — extract domain from email.

  • Postgres(REGEXP_MATCH(email, '@(.+)$'))[1] or SPLIT_PART(email, '@', 2).
  • MySQLSUBSTRING_INDEX(email, '@', -1).
  • SQL ServerRIGHT(email, LEN(email) - CHARINDEX('@', email)).

Pattern 3 — split delimited string into rows.

  • Postgresunnest(string_to_array(tags, ',')).
  • MySQL — recursive CTE or JSON_TABLE (8.0+).
  • SQL ServerSTRING_SPLIT(tags, ',') (2016+).

Pattern 4 — aggregate tags back into one string.

  • PostgresSTRING_AGG(tag, ', ' ORDER BY tag).
  • MySQLGROUP_CONCAT(tag ORDER BY tag SEPARATOR ', ').
  • SQL ServerSTRING_AGG(tag, ', ') WITHIN GROUP (ORDER BY tag).

Pattern 5 — pad to fixed width.

  • PostgresLPAD(id::text, 6, '0')'000042'.
  • MySQLLPAD(id, 6, '0').
  • SQL ServerRIGHT('000000' + CAST(id AS VARCHAR), 6).

Pattern 6 — case normalization for joins.

  • LOWER(TRIM(email)) on both sides of the join — or persist a email_normalized column at ingest.

Pattern 7 — find rows failing a validation rule.

SELECT *
FROM users
WHERE email !~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
Enter fullscreen mode Exit fullscreen mode

Pattern 8 — mask sensitive data in a report.

  • PostgresREGEXP_REPLACE(ssn, '(\d{3})-(\d{2})-(\d{4})', '***-**-\3').
  • Shows last four digits only; common compliance interview prompt.

SQL
Topic — string-parsing
String parsing drills

Practice →

SQL
Topic — string-concatenation
String concatenation patterns

Practice →


Choosing the right string function (cheat sheet)

A one-screen cheat sheet for using SQL string functions — pick the family that matches your intent, then the dialect-specific function.

You want to … Function family Postgres / Standard MySQL SQL Server
Join strings (NULL-safe) Concatenation CONCAT(a, b, c) CONCAT(a, b, c) CONCAT(a, b, c)
Join with separator (skip NULLs) Concatenation CONCAT_WS(', ', a, b) CONCAT_WS(', ', a, b) CONCAT_WS(', ', a, b)
Join strings (NULL-poisoning) Concatenation `a \ \ b`
Extract middle of string Slicing SUBSTRING(s FROM 2 FOR 3) SUBSTRING(s, 2, 3) SUBSTRING(s, 2, 3)
First / last N characters Slicing LEFT(s, n) / RIGHT(s, n) LEFT(s, n) / RIGHT(s, n) LEFT(s, n) / RIGHT(s, n)
Split on delimiter (field N) Slicing SPLIT_PART(s, ',', 2) SUBSTRING_INDEX(...) STRING_SPLIT + row number
Swap literal substring Replacement REPLACE(s, 'old', 'new') REPLACE(s, 'old', 'new') REPLACE(s, 'old', 'new')
Strip character set Replacement TRANSLATE(s, '012', '') nested REPLACE nested REPLACE
Pattern-based swap Replacement REGEXP_REPLACE(s, pat, rep, 'g') REGEXP_REPLACE(s, pat, rep) nested REPLACE / CLR
Strip leading/trailing spaces Cleaning TRIM(s) / BTRIM(s) TRIM(s) TRIM(s) (2017+)
Strip one side only Cleaning LTRIM(s) / RTRIM(s) LTRIM(s) / RTRIM(s) LTRIM(s) / RTRIM(s)
Simple wildcard filter Pattern LIKE 'abc%' LIKE 'abc%' LIKE 'abc%'
Case-insensitive filter Pattern ILIKE '%abc%' LOWER(s) LIKE '%abc%' LOWER(s) LIKE '%abc%'
Regex match / validate Pattern s ~ '^pattern$' REGEXP_LIKE(s, pat) PATINDEX('%pat%', s) > 0
Extract capture group Pattern (REGEXP_MATCH(s, pat))[1] REGEXP_SUBSTR(s, pat) application layer
Aggregate strings Aggregation STRING_AGG(col, ', ') GROUP_CONCAT(col) STRING_AGG(col, ', ')

Frequently asked questions

What's the difference between CONCAT and || in PostgreSQL?

CONCAT(a, b, c) is variadic and NULL-safe — it treats NULL arguments as empty strings, so CONCAT('a', NULL, 'b') returns 'ab'. The || operator follows standard SQL NULL propagation — if any operand is NULL, the entire result is NULL. Use || when you want strict NULL semantics and will wrap nullable columns with COALESCE. Use CONCAT() or CONCAT_WS() when building display strings from nullable name parts. MySQL's CONCAT() returns NULL if any argument is NULL (different from Postgres); use CONCAT_WS() there to skip NULLs. SQL Server's CONCAT() (2012+) matches Postgres null-safe behavior.

Is SUBSTRING 0-based or 1-based in SQL?

1-based in every major dialect — SUBSTRING('abcdef', 1, 3) returns 'abc', and position 1 is the first character. This differs from Python, JavaScript, and most programming languages where indexing starts at 0. Postgres also supports negative start positions (SUBSTRING(s FROM -3) returns the last three characters). When extracting after a delimiter, compute POSITION('@' IN email) + 1 as the start — the + 1 skips the delimiter itself. Saying "start is 1-based" aloud in an interview prevents the most common off-by-one bug.

When should I use REPLACE vs REGEXP_REPLACE?

Use REPLACE(s, old, new) when swapping a known literal substring — domain rewrites, removing hyphens, fixing a consistent typo. It's faster and easier to read. Use REGEXP_REPLACE(s, pattern, replacement, 'g') when the pattern is variable — strip all non-digits ('[^0-9]'), collapse whitespace ('\s+'), mask SSN digits. In Postgres, remember the 'g' flag for global replacement; without it, only the first match is replaced. For character-set removal (all punctuation), Postgres TRANSLATE() is often cleaner than regex.

How do I trim custom characters, not just spaces?

ANSI syntax: TRIM(LEADING '0' FROM phone) strips leading zeros; TRIM(BOTH ' #' FROM tag) strips spaces and # from both ends. Postgres adds LTRIM(s, chars) and RTRIM(s, chars) with a custom character set argument. MySQL's TRIM() only removes spaces — for custom characters use TRIM(LEADING '0' FROM col) (MySQL 8.0 supports the full ANSI form) or nested REPLACE. SQL Server 2017+ supports TRIM() for spaces; for custom chars use TRIM(' #' FROM col) in compatible modes or LTRIM(RTRIM(REPLACE(s, '#', ''))) as a fallback.

What's the difference between REGEXP and LIKE in SQL?

LIKE uses % (any-length wildcard) and _ (single character) for simple prefix/suffix/contains filters — 'a%@%.com' matches strings starting with a and containing @ then .com, but cannot enforce full email structure. regexp in sql (Postgres ~, MySQL REGEXP_LIKE) matches against a full regular expression — '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' validates email shape. Use LIKE for coarse filtering on indexed columns where possible; use regex for validation flags, extraction (REGEXP_MATCH), and complex replacement. Postgres ILIKE adds case-insensitive LIKE; other dialects emulate with LOWER(col) LIKE LOWER(pattern).


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to string functions (CONCAT, SUBSTRING, REPLACE, TRIM, REGEXP), string manipulation for cleaning and normalization, string parsing for delimiter splits and extraction, and the dialect quirks that fail candidates who memorise only one engine. Whether you're drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same five-family taxonomy this guide teaches.

Kick off via Explore practice →; drill the dedicated string-functions SQL lane →; fan out into the string-manipulation SQL lane →; rehearse string-parsing drills →; sharpen string-concatenation patterns →; widen coverage on the full SQL practice library →.

Top comments (0)