DEV Community

Cover image for SQL LIKE, REGEXP & Wildcard Pattern Matching for Data Engineers
Gowtham Potureddi
Gowtham Potureddi

Posted on

SQL LIKE, REGEXP & Wildcard Pattern Matching for Data Engineers

Every production data pipeline eventually grows a layer of sql query with wildcard logic — email validation, log line filtering, name normalisation, URL bucket extraction, fuzzy joins between vendor feeds. The shape is always the same: a string column, a pattern, and a WHERE clause that decides which rows survive. Interviewers love this surface because the right answer changes with dataset size, dialect, and where the leading wildcard sits — three knobs that separate a junior **like in sql** answer from a senior sql pattern matching plan that holds at scale.

This guide walks the three pattern-matching layers you will reach for in real data engineering work: **sql like operator** with % and _ for plain prefix and suffix matches, ILIKE plus collation tricks for case-insensitive search across five dialects, and **sql regexp** with capture groups for structured extraction. We then close with the operational story interviewers actually care about — why LIKE '%abc%' becomes the #1 prod offender at 10 million rows, how Postgres' pg_trgm + GIN index turns it back into a bitmap lookup, and where MySQL's FULLTEXT, Snowflake's search-optimization service, and an external Elasticsearch / OpenSearch tier each take over. Every section ships with a worked teaching example and a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why the chosen pattern beats the naive one. By the end you'll have a reusable mental model for sql wildcards that anchors patterns to the start whenever possible, escapes literal % cleanly, and reaches for trigram indexes before the team meeting where someone proposes pulling Elasticsearch into the stack.

PipeCode blog header for a SQL LIKE + REGEXP pattern-matching deep dive — bold white headline 'SQL LIKE · REGEXP · Wildcards' with subtitle 'Pattern matching · ILIKE · trigram indexes' and a stylised %_ wildcard card on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the pattern-matching practice library →, pair it with string function problems →, and pressure-test it with filtering drills →.


On this page


1. Why pattern matching shows up in every prod query

sql query with wildcard lives at the intersection of data cleaning, search, and analytics

The one-sentence invariant: every dataset eventually grows a string column where the only way to filter, validate, or extract a slice is to match against a pattern — and the pattern's shape determines whether your query runs in milliseconds or scans the whole table. Once you internalise the three layers — LIKE for plain % / _ wildcards, ILIKE for case-insensitive convenience, REGEXP for structured capture — the rest of the sql wildcards discussion is a sequence of trade-offs: simplicity vs power, anchored vs unanchored, indexed vs full scan.

Where pattern matching shows up in real DE work.

  • Email validation. WHERE email LIKE '%@%.%' is the quick-and-dirty gate; the regex ^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$ is the rigorous one.
  • URL bucketing. WHERE url LIKE '/checkout/%' to bucket pages into funnels, or a regex to peel out the campaign id from query strings.
  • Log line filtering. WHERE message LIKE '%ERROR%' for a quick triage; trigram-backed WHERE message ILIKE '%timeout%' once the table grows.
  • Name normalisation. WHERE last_name ILIKE 'mc%' to catch McKinsey, McKinney, etc., regardless of capitalisation.
  • Fuzzy joins. Vendor A writes "Apple Inc.", vendor B writes "Apple, Inc" — a normalised LIKE or a trigram similarity score reconciles them.
  • Compliance scans. WHERE col REGEXP '^[0-9]{3}-[0-9]{2}-[0-9]{4}$' to detect SSN-shaped strings in a column that should not contain PII.

The three layers in one sentence each.

  • LIKE and ILIKE — string operators with two wildcards (% for zero-or-more, _ for exactly-one) and an optional escape clause. Cheap to write, supported everywhere, fast when the pattern is anchored at the start.
  • REGEXP / ~ / REGEXP_LIKE — full regular-expression matching with character classes, quantifiers, alternation, and capture groups. Vastly more expressive; always scans unless paired with a specialised index.
  • Search / FULLTEXT / trigram indexespg_trgm + GIN on Postgres, FULLTEXT MATCH AGAINST on MySQL, search optimisation service on Snowflake, the SEARCH function on BigQuery, and external Elasticsearch / OpenSearch for the truly large case.

Why "naive LIKE" is the #1 prod performance offender.

  • A pattern that starts with % (e.g. '%timeout%') cannot use a B-tree index on the column — the database has to read every row and run the matcher.
  • A pattern that starts with a literal prefix (e.g. '/checkout/%') is a range scan: the database walks the index from '/checkout/' to '/checkout0' (the next byte) and only reads those rows.
  • The 100-row dev table hides this completely; the 10-million-row prod table makes it the slowest query in the warehouse.
  • The fix is one of three things: (a) anchor the pattern to the start, (b) install a trigram or FULLTEXT index, or (c) move text search to a dedicated engine.

What interviewers listen for.

  • Do you immediately distinguish anchored ('abc%') from unanchored ('%abc%') patterns when performance comes up? — senior signal.
  • Do you reach for ILIKE on Postgres / Snowflake and COLLATE on SQL Server when case-insensitivity is asked? — dialect literacy.
  • Do you mention pg_trgm + GIN when the interviewer asks "how do you make LIKE '%foo%' fast?" — production-grade answer.
  • Do you reach for REGEXP capture groups when the question is "extract the domain from the email"? — required answer.

Worked example — pick the right layer for five real queries

Detailed explanation. Most pattern-matching choices are obvious once you put the query into one of three buckets — exact / prefix / suffix (LIKE), case-insensitive substring (ILIKE or LOWER + LIKE), structured extraction or validation (REGEXP). Picking the wrong layer is the most common interview slip: people reach for regex when LIKE would do, or for LIKE when they need a regex.

Question. For each query below, pick the correct layer (LIKE / ILIKE / REGEXP) and justify it in one sentence:

  1. Find all rows where url starts with /checkout/.
  2. Find all rows where name contains smith regardless of capitalisation.
  3. Find all rows where email matches a strict RFC-ish format.
  4. Find all rows where code is exactly three uppercase letters.
  5. Extract the domain portion of email into a new column.

Input (sample rows).

id url name email code
1 /checkout/start Smith, Alice alice@acme.io ABC
2 /cart smith Bob bob@@acme.io A1C
3 /checkout/pay Smithers not-an-email abc
4 /home Doe carol@beta.co XYZ
5 /checkout/done SMITH, Dan dan@gamma.net ABCD

Code.

-- 1. starts-with → LIKE (anchored)
SELECT id FROM events WHERE url LIKE '/checkout/%';

-- 2. case-insensitive contains → ILIKE (Postgres / Snowflake)
SELECT id FROM events WHERE name ILIKE '%smith%';

-- 3. strict format → REGEXP (Postgres ~ operator)
SELECT id FROM events
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';

-- 4. exactly 3 uppercase letters → REGEXP (LIKE cannot express character class)
SELECT id FROM events WHERE code ~ '^[A-Z]{3}$';

-- 5. extract domain → REGEXP capture group
SELECT id, (regexp_match(email, '^[^@]+@(.+)$'))[1] AS domain FROM events;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. url LIKE '/checkout/%' — anchored prefix. The database can use a B-tree index on url and do a range scan from /checkout/ to /checkout0. Rows 1, 3, 5 match.
  2. name ILIKE '%smith%' — leading wildcard kills the B-tree on name. The pattern is case-insensitive thanks to ILIKE. Rows 1, 2, 3, 5 match (Smith, smith, Smithers, SMITH).
  3. email ~ '^...$' — only LIKE can't express "one-or-more of a character class." Rows 1, 4, 5 match; row 2 fails on @@; row 3 has no @.
  4. code ~ '^[A-Z]{3}$' — LIKE could express "three of any character" with '___', but not "three uppercase letters." Row 1 (ABC) matches; row 2 has a digit; row 3 is lowercase; row 5 has 4 chars.
  5. regexp_match(email, '^[^@]+@(.+)$')[1] — the capture group (.+) after the @ extracts the domain into a new column. acme.io, beta.co, gamma.net for the valid rows.

Output.

Query Layer Matches Why
1 — url starts with /checkout/ LIKE 1, 3, 5 anchored prefix, index range scan
2 — name contains "smith" ILIKE 1, 2, 3, 5 case-insensitive convenience
3 — strict email format REGEXP 1, 4, 5 needs character class + quantifier
4 — exactly 3 uppercase REGEXP 1 LIKE can't constrain to uppercase
5 — extract domain REGEXP capture 1, 4, 5 needs (.+) capture group

Rule of thumb. Reach for LIKE first; bump to ILIKE when you need case-insensitivity on Postgres / Snowflake; only reach for REGEXP when you need character classes, quantifiers, alternation, or capture groups. Don't pay the regex tax for a query that LIKE would handle in five fewer characters.

Worked example — anatomy of a failed shortcut from LIKE to REGEXP

Detailed explanation. A common code-review smell is a regex where a LIKE would have done. The cost of the wrong choice is twofold: harder-to-read SQL and a planner that can no longer use a B-tree index. Walking through the rewrite both ways teaches the trade-off concretely.

Question. A junior engineer wrote WHERE order_id ~ '^ORD-[0-9]+$' to find every row whose order_id follows the format ORD-{digits}. Rewrite this two ways: (a) using LIKE plus an explicit length / digit check, and (b) using a stricter regex that pins the digit count. Compare the index usage of all three forms.

Input (sample rows).

id order_id
1 ORD-1001
2 ORD-2002
3 ord-3003
4 ORDER-4004
5 ORD-A2B
6 ORD-

Code.

-- Original (regex, full scan)
SELECT id FROM orders WHERE order_id ~ '^ORD-[0-9]+$';

-- Rewrite (a) — LIKE with anchored prefix; relies on a separate check for "all digits after prefix"
SELECT id FROM orders
WHERE order_id LIKE 'ORD-%'
  AND order_id NOT LIKE 'ORD-%[^0-9]%';   -- MySQL / SQL Server character class

-- Rewrite (b) — regex pinned to exactly N digits (4-8 in this catalogue)
SELECT id FROM orders WHERE order_id ~ '^ORD-[0-9]{4,8}$';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Original regex — anchored at both ends, but the planner still does a full scan unless pg_trgm is installed. The anchor ^ORD- could be a B-tree range scan, but the regex predicate disables that path.
  2. Rewrite (a) — splits the predicate into an anchored LIKE 'ORD-%' (B-tree-friendly) plus a content check NOT LIKE 'ORD-%[^0-9]%'. The planner can use the B-tree on order_id to narrow to candidates, then apply the second clause as a filter.
  3. Rewrite (b) — same regex shape but with {4,8} instead of +. This rejects ORD-A2B and ORD- cleanly and adds a useful length sanity check.
  4. Row 5 (ORD-A2B) — fails original regex (letters), passes rewrite (a) only after the NOT LIKE check, fails rewrite (b).
  5. Row 6 (ORD-) — has the prefix but no digits. Original regex rejects (the + requires at least one digit). Rewrite (a) rejects (the NOT LIKE check would not match — needs careful framing — depends on dialect). Rewrite (b) rejects ({4,8} requires at least 4 digits).

Output.

Form Index used Matches Notes
~ '^ORD-[0-9]+$' none (seq scan) 1, 2 correct but slow on large tables
LIKE 'ORD-%' AND NOT LIKE 'ORD-%[^0-9]%' B-tree range scan + filter 1, 2 fast on Postgres; dialect-sensitive bracket syntax
~ '^ORD-[0-9]{4,8}$' + pg_trgm GIN on ORD- literal 1, 2 best when trigram index is already present

Rule of thumb. When a regex starts with a literal prefix (^ORD-), rewrite it to a LIKE 'ORD-%' predicate plus a secondary filter for the content. The B-tree range scan on the prefix is almost always the right plan on a transactional database; the secondary filter rarely costs more than the regex would have.

SQL interview question on picking the right pattern-matching layer

A common opener: "I have a WHERE email LIKE '%@gmail.com' query that runs nightly. It's slow and I can't add an index because of the leading %. How would you reason about it?" The interviewer is testing whether you can name the leading-wildcard problem, separate the correctness layer (LIKE handles it) from the performance layer (leading wildcard kills the index), and propose a fix without over-engineering.

Solution Using reverse-string trick + indexed range scan

-- Original (slow)
SELECT id FROM users WHERE email LIKE '%@gmail.com';

-- Fix: store reverse(email) and use an anchored prefix on the reversed value
ALTER TABLE users ADD COLUMN email_rev TEXT
  GENERATED ALWAYS AS (reverse(email)) STORED;
CREATE INDEX idx_users_email_rev ON users (email_rev);

-- New query — anchored prefix → index range scan
SELECT id FROM users WHERE email_rev LIKE reverse('@gmail.com') || '%';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
1 Add generated column email_rev = reverse(email) every row gets a reversed copy of the email, maintained automatically
2 Create B-tree index on email_rev lookups on email_rev use the index
3 Rewrite query as `email_rev LIKE reverse('@gmail.com') \ \
4 Planner picks index range scan from {% raw %}moc.liamg@ to moc.liamg{ reads only matching rows

Output:

Metric Before (LIKE '%@gmail.com') After (reverse trick)
Plan sequential scan index range scan
Rows read 10,000,000 ≈ 1,200
Cost 220,000 18
Wall time (p50) 8.4 s 6 ms

Why this works — concept by concept:

  • Leading-wildcard problemLIKE '%suffix' can never use a B-tree on the original column because B-tree indexes are sorted by prefix. The planner has no choice but to read every row.
  • Reverse-string trick — by indexing reverse(email), the suffix '@gmail.com' becomes the prefix 'moc.liamg@' of the reversed string, which the B-tree can range-scan.
  • Generated column — Postgres maintains email_rev automatically on every insert / update; no application code touches it.
  • Alternative: trigram index — for general substring matching (%foo%, not just %suffix), the right answer is pg_trgm + GIN instead of the reverse trick. The reverse trick beats trigrams only when the pattern is strictly a suffix.
  • Cost — disk = O(N) extra (one reversed copy); index = O(N log N) build; query = O(log N + matches) range scan.

SQL
Topic — pattern matching
SQL pattern-matching problems (LIKE, REGEXP, wildcards)

Practice →


2. LIKE basics — % and _ wildcards plus escape characters

like in sql is two wildcards, one optional escape, and a single performance rule — anchor or pay

The mental model in one line: LIKE is the SQL comparison operator that matches a string against a pattern containing exactly two wildcards — % for zero-or-more characters and _ for exactly-one character — plus an optional ESCAPE clause for matching the wildcards literally. Once you say "anchor the pattern to the start whenever possible," half of the sql like operator interview surface collapses to obvious answers.

Visual diagram of LIKE wildcards — left a cheat card for % (zero or more chars) and _ (exactly one char); right a 3-row example showing 'abc%' (starts-with, fast), '%abc' (ends-with, slow), '%abc%' (contains, slow); a small escape-character chip; on a light PipeCode card.

The two wildcards.

  • % — matches zero or more characters of any kind. 'abc%' matches 'abc', 'abcd', 'abc123'; '%abc' matches anything ending in 'abc'; '%abc%' matches anything containing 'abc'.
  • _ — matches exactly one character. 'a_c' matches 'abc', 'axc', 'a1c'; but not 'ac' (zero chars) or 'abbc' (two). Combined: 'a__c' matches 'abbc', 'a12c'; 'abc_' matches any four-character string starting with 'abc'.

The three shapes and their performance.

  • col LIKE 'prefix%' — anchored prefix. Uses a B-tree index range scan. The fastest shape; this is the one you want.
  • col LIKE '%suffix' — anchored suffix. Full scan on a B-tree-indexed column. Fix: store reverse(col) and index it.
  • col LIKE '%substring%' — unanchored contains. Full scan. Fix: pg_trgm + GIN (Postgres), FULLTEXT (MySQL), search optimisation (Snowflake), SEARCH function (BigQuery), or external Elasticsearch.

Escape characters.

  • The wildcards themselves are not literal characters. To match a literal % or _ you must escape it.
  • Postgres / MySQL / SQL Server / Snowflake / BigQuery all support the standard ESCAPE clause: WHERE col LIKE '50\%' ESCAPE '\' matches the literal four-character string '50%'.
  • You can pick any single character as the escape: LIKE '50!%' ESCAPE '!' is equally valid and useful when \ already appears in the data.

Case sensitivity by default.

  • Postgres, Snowflake, BigQueryLIKE is case-sensitive. Use ILIKE (Postgres / Snowflake) or LOWER(col) LIKE LOWER(pattern) (portable).
  • MySQLLIKE is case-insensitive by default because of the default utf8mb4_general_ci collation. Switch to a _cs (case-sensitive) collation if you need case-sensitive matching.
  • SQL Server — depends on the column's collation. Latin1_General_CI_AS is case-insensitive; Latin1_General_CS_AS is case-sensitive. You can override per-query with COLLATE.

Worked example — match phone numbers with a mix of _ and %

Detailed explanation. Phone numbers come in two shapes: a strict ten-digit format (555-123-4567) and a free-form one with optional country code (+1 555 123 4567). Picking the right wildcard mix is a classic like in sql interview probe — it tests whether you remember that _ is one char and % is zero-or-more.

Question. Write a LIKE pattern that matches the format XXX-XXX-XXXX exactly (no country code, no parentheses), where each X is any character. Then write a second pattern that matches the same format optionally preceded by +1 (a plus sign, the digit one, and a space).

Input (sample rows).

id phone
1 555-123-4567
2 +1 555-123-4567
3 555 123 4567
4 (555) 123-4567
5 555-1234
6 555-123-4567-ext

Code.

-- Pattern 1: strict XXX-XXX-XXXX
SELECT id, phone FROM contacts
WHERE phone LIKE '___-___-____';

-- Pattern 2: optional "+1 " prefix
SELECT id, phone FROM contacts
WHERE phone LIKE '___-___-____'
   OR phone LIKE '+1 ___-___-____';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pattern 1 uses 12 underscores total___-___-____ is exactly three underscores, a literal hyphen, three underscores, a literal hyphen, four underscores. LIKE matches the whole string, so this pattern only matches strings of length 12 in that exact shape.
  2. Row 1 (555-123-4567) — 12 chars in the right shape → matches.
  3. Row 2 (+1 555-123-4567) — 15 chars → does not match pattern 1. Pattern 2 explicitly includes the +1 prefix and matches.
  4. Row 3 (555 123 4567) — spaces instead of hyphens → does not match either pattern.
  5. Row 5 (555-1234) — only 8 chars → does not match.
  6. Row 6 (555-123-4567-ext) — 16 chars, extra suffix → does not match either pattern (the patterns are anchored at both ends since LIKE matches the whole string).

Output.

id phone Pattern 1 Pattern 2
1 555-123-4567 match match
2 +1 555-123-4567 match
3 555 123 4567
4 (555) 123-4567
5 555-1234
6 555-123-4567-ext

Rule of thumb. _ is the right tool when you need exactly N characters in this shape. Reach for % when you accept any number of characters here — including zero. Combining the two is how LIKE expresses constraints like "10 digits with hyphens" without dropping into regex.

Worked example — escape a literal % and _ in a marketing campaigns table

Detailed explanation. A marketing analytics table has a campaign_name column with values like "Summer Sale 50% Off", "Q3_Launch", and "Black_Friday 25%". The interviewer asks for two queries: one that finds every campaign whose name contains a literal percentage sign, and one that finds every campaign whose name starts with "Q3_" (underscore is the literal third character, not a wildcard).

Question. Write the two queries on Postgres. Then show the equivalent on MySQL where the default escape character is \. Then write a third query that finds every campaign with both a literal % and a literal _ somewhere in the name.

Input (sample rows).

id campaign_name
1 Summer Sale 50% Off
2 Q3_Launch
3 Black_Friday 25%
4 Spring Renewal
5 Year_End
6 Holiday 100% Match

Code.

-- 1. Names containing a literal % — designate `!` as the escape char
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE '%!%%' ESCAPE '!';

-- 2. Names starting with literal "Q3_" — escape the underscore
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE 'Q3!_%' ESCAPE '!';

-- 3. Names containing both a literal % and a literal _
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE '%!%%' ESCAPE '!'
  AND campaign_name LIKE '%!_%' ESCAPE '!';

-- Same on MySQL with default backslash escape (works on Postgres too)
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE '%\%%' ESCAPE '\\';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. LIKE '%!%%' ESCAPE '!' — the ESCAPE '!' clause designates ! as the escape character. The pattern reads as: any number of characters, then a literal % (because !% is the escape sequence), then any number of characters.
  2. LIKE 'Q3!_%' ESCAPE '!'!_ is the escape for a literal underscore. So the pattern reads as Q3 followed by a literal _ followed by any number of characters. Row 2 (Q3_Launch) matches; row 5 (Year_End) doesn't (wrong prefix).
  3. Combined query — the two LIKE clauses are independent; both must hold. Row 3 (Black_Friday 25%) matches both (it has a _ and a %).
  4. MySQL with backslash'%\%%' ESCAPE '\\' is the same pattern. The double backslash in the SQL string is the standard string-literal escape; the resulting four-character pattern is % \ % %, which LIKE reads as %, escape-%, %.
  5. Why pick ! over \ — when the data itself contains backslashes (file paths, JSON-encoded strings), \ collides. ! (or any unused character) makes the intent unambiguous.

Output.

Query Matches
1 — literal % 1, 3, 6
2 — starts with literal Q3_ 2
3 — both literal % and literal _ 3

Rule of thumb. When your data may contain backslashes, pick a non-data character (!, |, #) as the ESCAPE. The ESCAPE clause works the same way on every major dialect — Postgres, MySQL, SQL Server, Snowflake, and BigQuery — so the pattern you write here ports cleanly.

SQL interview question on building an anchored search

A senior probe sounds like: "We log every page view into a url column. Write a query that returns the count of page views per top-level section, where a section is the slug between the first and second slashes — /checkout/, /cart/, /account/, etc. Solve it with LIKE only, no regex." The interviewer is checking whether you can build anchored patterns that compose into a GROUP BY.

Solution Using anchored LIKE patterns with explicit OR

SELECT
  CASE
    WHEN url LIKE '/checkout/%' THEN 'checkout'
    WHEN url LIKE '/cart%'      THEN 'cart'
    WHEN url LIKE '/account/%'  THEN 'account'
    WHEN url LIKE '/search%'    THEN 'search'
    WHEN url LIKE '/blog/%'     THEN 'blog'
    ELSE 'other'
  END AS section,
  COUNT(*) AS pageviews
FROM events
WHERE event_type = 'page_view'
GROUP BY 1
ORDER BY pageviews DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

url which LIKE matches section
/checkout/start '/checkout/%' checkout
/checkout/done '/checkout/%' checkout
/cart '/cart%' (no trailing / required) cart
/cart/items '/cart%' cart
/account/settings '/account/%' account
/search?q=widgets '/search%' (no trailing slash) search
/blog/post-1 '/blog/%' blog
/unknown-page none — fallback other

Output:

section pageviews
checkout 1,820,331
cart 1,402,557
blog 880,210
search 612,884
account 488,712
other 312,401

Why this works — concept by concept:

  • Anchored prefix patterns — every LIKE clause starts with a literal prefix, so the database can use a B-tree index on url and do a range scan per branch. No leading wildcards anywhere.
  • CASE WHEN ordering — the first matching branch wins, which lets you put the more specific patterns earlier (e.g. /cart/items matches /cart% before any fallback).
  • Trailing / vs no /'/cart%' (no slash) catches both /cart (the index page) and /cart/items (a sub-page). '/cart/%' (with slash) would miss /cart itself. Pick the shape that matches the URL convention.
  • Bucket fallback — the ELSE 'other' branch ensures every row is counted somewhere. Without it, GROUP BY would silently drop rows whose URL doesn't match any pattern.
  • Cost — query = O(N) full scan since we touch every row; per-row pattern match = O(L) where L is the URL length. The B-tree index helps if the planner can rewrite the CASE into per-branch range scans (Postgres 12+ does this for trivial CASEs; older versions may need a manual UNION ALL).

SQL
Topic — pattern matching
LIKE pattern-matching problems (SQL)

Practice →


3. ILIKE and case-insensitive matching across dialects

sql like operator becomes a dialect quiz the moment "case-insensitive" is mentioned

The mental model in one line: ILIKE is the Postgres / Snowflake convenience operator that does what LIKE does, but case-insensitively; MySQL gives you case-insensitive LIKE for free via the default collation; SQL Server controls case sensitivity via column collation; BigQuery makes you write a regex with the (?i) flag. Five dialects, five answers, one production decision: use the dialect's native shape whenever possible.

Visual ILIKE dialect matrix — 5 dialect columns (Postgres, MySQL, SQL Server, Snowflake, BigQuery) and one row showing how each implements case-insensitive matching (ILIKE, LIKE-by-collation, COLLATE CI, ILIKE, REGEXP_CONTAINS); a small performance note at the bottom; on a light PipeCode card.

The five-dialect matrix.

  • Postgrescol ILIKE '%abc%' is a native operator that matches case-insensitively. Same semantics as LIKE, just no respect for case. Backed by ~~* under the hood.
  • MySQLcol LIKE '%abc%' is already case-insensitive by default because the default collation (utf8mb4_0900_ai_ci in MySQL 8) is case-insensitive. Switch to LIKE BINARY or a _bin / _cs collation if you need case-sensitive matching.
  • SQL Servercol LIKE '%abc%' defers to the column's collation. Default Latin1_General_CI_AS is case-insensitive. Override per-query: col LIKE '%abc%' COLLATE Latin1_General_CI_AI. There is no ILIKE keyword.
  • Snowflakecol ILIKE '%abc%' is a native operator, same semantics as Postgres. Also supports the LIKE_ANY variants for multi-pattern matching.
  • BigQuery — no ILIKE keyword. Use REGEXP_CONTAINS(col, r'(?i)abc') with the inline (?i) flag, or LOWER(col) LIKE LOWER('%abc%') as a portable fallback.

Portable workaround.

-- Works on every dialect, kills any index on `col`
WHERE LOWER(col) LIKE LOWER('%abc%')
Enter fullscreen mode Exit fullscreen mode

This pattern is universally understood but expensive — applying LOWER() to the column prevents any B-tree index from being used. A functional index on LOWER(col) (Postgres, Oracle, MySQL 8.0.13+) restores the index path; SQL Server uses computed columns.

Performance — ILIKE vs LIKE.

  • On Postgres, ILIKE is typically 10–20 % slower than LIKE because the matcher does extra collation work per byte. The difference is small on short strings but adds up on TEXT columns with multi-byte UTF-8.
  • On Snowflake, the difference is negligible at the storage layer (micro-partitions are scanned columnarly anyway).
  • On MySQL, LIKE with a case-insensitive collation is the fast path; LIKE BINARY (case-sensitive) is slightly faster on ASCII columns but identical on most production workloads.
  • On SQL Server, COLLATE clauses inside WHERE can disable index seeks if the chosen collation differs from the index's collation. Watch this in execution plans.

Locale and Unicode gotchas.

  • Turkish dotless ı vs dotted i, German ß vs ss, the various Greek accents — the right collation matters when international data flows in.
  • Postgres ILIKE uses the database default collation; override with COLLATE "C" (byte-wise) or COLLATE "und-x-icu" for ICU-driven Unicode rules.
  • Snowflake's ILIKE is Unicode-aware out of the box.

Worked example — same case-insensitive contains query across five dialects

Detailed explanation. A common practice question is: "Find every customer whose name contains 'smith' in any capitalisation. Write the query for Postgres, MySQL, SQL Server, Snowflake, and BigQuery." The query is trivial — what's tested is whether you know each dialect's native shape.

Question. Given a customers table with a full_name column, write five queries that return the same result — every row whose full_name contains 'smith' in any case — one per dialect. Then state the expected performance order.

Input (sample rows).

id full_name
1 Alice SMITH
2 bob smithers
3 Carol Smyth
4 Dan O'Smith
5 Eve Doe

Code.

-- Postgres
SELECT id, full_name FROM customers
WHERE full_name ILIKE '%smith%';

-- MySQL (default case-insensitive collation — LIKE is enough)
SELECT id, full_name FROM customers
WHERE full_name LIKE '%smith%';

-- SQL Server (force CI collation in case the column is CS)
SELECT id, full_name FROM customers
WHERE full_name LIKE '%smith%' COLLATE Latin1_General_CI_AS;

-- Snowflake
SELECT id, full_name FROM customers
WHERE full_name ILIKE '%smith%';

-- BigQuery (no ILIKE; use REGEXP_CONTAINS with (?i))
SELECT id, full_name FROM customers
WHERE REGEXP_CONTAINS(full_name, r'(?i)smith');
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Postgres ILIKE is a single token; the planner translates it to the internal ~~* operator. No collation argument needed.
  2. MySQL LIKE picks up case-insensitivity from utf8mb4_0900_ai_ci. If the column is bound to a _cs collation, the same query becomes case-sensitive — you'd need COLLATE utf8mb4_general_ci per-query.
  3. SQL Server infers case from the column collation. If you can't be sure, the COLLATE clause forces it. Be aware: explicit COLLATE in WHERE may disable index seeks.
  4. Snowflake ILIKE is the same shape as Postgres. Snowflake also offers ILIKE_ANY('%smith%', '%jones%') for multi-pattern OR — convenient when the pattern list grows.
  5. BigQuery has no ILIKE. The cleanest substitute is REGEXP_CONTAINS(col, r'(?i)pattern') because the r'(?i)...' raw string with inline flag is idiomatic. A portable fallback is LOWER(col) LIKE LOWER('%smith%').

Output.

Dialect Query Matches
Postgres ILIKE '%smith%' 1, 2, 4
MySQL LIKE '%smith%' (default CI) 1, 2, 4
SQL Server LIKE … COLLATE … CI_AS 1, 2, 4
Snowflake ILIKE '%smith%' 1, 2, 4
BigQuery REGEXP_CONTAINS(col, r'(?i)smith') 1, 2, 4

Rule of thumb. Use the dialect's native shape whenever possible (ILIKE on PG/SF, LIKE on MySQL, COLLATE on SQL Server, REGEXP_CONTAINS on BigQuery). The portable LOWER(col) LIKE LOWER(pattern) workaround works everywhere but pays the cost of a per-row function call and disables ordinary B-tree indexes.

Worked example — LIKE_ANY and ILIKE_ANY for multi-pattern OR on Snowflake

Detailed explanation. A common refactor target is a WHERE col ILIKE '%foo%' OR col ILIKE '%bar%' OR col ILIKE '%baz%' OR … chain that grows alongside a hard-coded list of brand names, error codes, or category keywords. Snowflake ships LIKE_ANY / ILIKE_ANY to compress the chain into one operator with a comma-separated pattern list.

Question. A support_tickets table needs to flag every row whose subject mentions any of refund, chargeback, dispute, or cancellation — case-insensitive. Show the naive chained OR ILIKE form, the Snowflake ILIKE_ANY form, the Postgres ANY (ARRAY[…]) form, and the regex alternation form. Compare them on readability and performance.

Input (sample rows).

id subject
1 Need refund for order #4711
2 Chargeback notice received
3 Unable to log in
4 DISPUTE: missing item
5 Order cancellation request
6 When will my order ship?

Code.

-- 1. Naive chained OR — works everywhere; verbose
SELECT id, subject FROM support_tickets
WHERE subject ILIKE '%refund%'
   OR subject ILIKE '%chargeback%'
   OR subject ILIKE '%dispute%'
   OR subject ILIKE '%cancellation%';

-- 2. Snowflake — ILIKE_ANY
SELECT id, subject FROM support_tickets
WHERE subject ILIKE ANY ('%refund%', '%chargeback%', '%dispute%', '%cancellation%');

-- 3. Postgres — ANY (ARRAY[...]) with ILIKE
SELECT id, subject FROM support_tickets
WHERE subject ILIKE ANY (ARRAY['%refund%', '%chargeback%', '%dispute%', '%cancellation%']);

-- 4. Regex alternation — portable on Postgres / MySQL / Snowflake / BigQuery
SELECT id, subject FROM support_tickets
WHERE subject ~* 'refund|chargeback|dispute|cancellation';   -- Postgres
-- WHERE REGEXP_CONTAINS(subject, r'(?i)refund|chargeback|dispute|cancellation');  -- BigQuery
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Naive OR form — explicit, readable, and supported on every dialect. The planner evaluates each branch separately; if pg_trgm GIN is present on subject, each branch is a bitmap index scan and the planner OR-merges the bitmaps.
  2. Snowflake ILIKE ANY — one operator, one pattern list, one scan. Internally optimised — the engine compiles all patterns into a single matcher.
  3. Postgres ANY (ARRAY[…]) — the SQL-standard shape using array constants. Equivalent in semantics to the naive OR form; whether the planner is smart enough to use a single trigram scan instead of OR-merging is version-dependent (PG 14+ usually does).
  4. Regex alternation'refund|chargeback|dispute|cancellation' is one regex; ~* is the case-insensitive match operator. Compiles to a single state machine — efficient at runtime.
  5. Performance ordering — on a 50 M row Postgres table with pg_trgm, all four forms are within 10 % of each other. On a Snowflake table with search optimisation, ILIKE ANY is typically fastest because it bundles the pattern list. On a dialect without text indexes, all four are full scans of equal cost.

Output.

Form Matches Readability
Naive chained OR ILIKE 1, 2, 4, 5 medium
Snowflake ILIKE ANY 1, 2, 4, 5 high
Postgres ANY (ARRAY[…]) 1, 2, 4, 5 high
Regex alternation 1, 2, 4, 5 medium

Rule of thumb. For more than three patterns, prefer ILIKE ANY (Snowflake) / ANY (ARRAY[…]) (Postgres) / regex alternation (everywhere else). The naive OR chain quickly becomes unreadable; the consolidated forms are easier to refactor when the pattern list comes from a config file or a IN-style sub-query.

SQL interview question on case-insensitive search at scale

A typical scaling-focused probe: "Our customers.full_name column has 50 million rows. The query LOWER(full_name) LIKE LOWER('%smith%') runs in 45 seconds. How do you make it sub-second without changing the SQL contract for callers?" The interviewer wants you to combine an expression index with an ILIKE rewrite (or a trigram index — covered in §5).

Solution Using a functional index on LOWER(full_name) plus an anchored rewrite

-- 1. One-time: create a functional index on the lowercased column
CREATE INDEX idx_customers_name_lower
  ON customers (LOWER(full_name) text_pattern_ops);

-- 2. Rewrite the contains query as an anchored pair where possible.
-- If the contract requires "contains," install pg_trgm + GIN instead (§5).
-- For "starts-with" or "ends-with" you can use the functional index directly:
SELECT id, full_name FROM customers
WHERE LOWER(full_name) LIKE LOWER('smith%');     -- starts-with → index range scan

-- 3. For genuine "contains" semantics, the functional index alone is not enough.
-- The right answer is pg_trgm + GIN (§5):
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_customers_name_trgm
  ON customers USING GIN (full_name gin_trgm_ops);

SELECT id, full_name FROM customers
WHERE full_name ILIKE '%smith%';                  -- now uses the GIN index
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Action Effect
1 CREATE INDEX … (LOWER(full_name) text_pattern_ops) Postgres maintains an expression index keyed on the lowercased name
2 Rewrite '…smith…''smith%' where the contract allows starts-with patterns can range-scan the index
3 If contract requires '%smith%' (genuine contains) → install pg_trgm GIN ILIKE '%smith%' now uses the GIN index via trigram lookup
4 Planner picks the lighter index wall-time drops 1000×

Output:

Variant Plan Rows read Wall time (p50)
LOWER(full_name) LIKE LOWER('%smith%') (no index) seq scan 50,000,000 45 s
LOWER(full_name) LIKE 'smith%' + functional B-tree index range scan 1,800 8 ms
full_name ILIKE '%smith%' + pg_trgm GIN bitmap index scan 1,800 14 ms

Why this works — concept by concept:

  • Functional index on LOWER(col) — turns the LOWER(col) LIKE 'smith%' shape into an index range scan because the index key is precomputed.
  • text_pattern_ops opclass — required for LIKE index usage on text columns when the database default collation is not C. Without it, the planner can't prove the range scan is safe.
  • Trigram index for containspg_trgm breaks %smith% into 5 trigrams (smi, mit, ith, ...) and looks each one up in a GIN inverted index. Covered in depth in §5.
  • SQL contract preservation — adding indexes does not change the query result, only the plan. Callers see the same rows, faster.
  • Cost — disk = O(N) extra per index; build = O(N log N); query drops from O(N) full scan to O(log N + matches) for the prefix case, O(matches) for the trigram case.

SQL
Topic — string functions
Case-insensitive search + string function problems

Practice →


4. REGEXP, SIMILAR TO, and regex functions across dialects

sql regexp unlocks character classes, quantifiers, alternation, and capture groups that LIKE cannot express

The mental model in one line: regex in SQL is the same PCRE-flavoured language you know from Python and JavaScript, surfaced through dialect-specific operators (~, ~*) or functions (REGEXP_LIKE, REGEXP_CONTAINS, REGEXP_MATCH, REGEXP_EXTRACT, REGEXP_REPLACE, REGEXP_SUBSTR) — and the price you pay is no index, always a full scan, unless you pair the regex with a separate text-search index. Reach for regex when LIKE can't express the constraint, not before.

Visual diagram of REGEXP capture groups — input string 'alice@example.com' on the left, regex pattern '^([^@]+)@(.+)$' with two capture groups colour-coded, output showing group 1 'alice' and group 2 'example.com'; a 4-dialect function chip strip at the bottom; on a light PipeCode card.

The dialect surface.

  • Postgres — operators ~ (match), ~* (case-insensitive match), !~ and !~* (negated). Functions regexp_match, regexp_matches, regexp_replace, regexp_split_to_array, regexp_split_to_table. Backed by the PCRE-compatible engine baked into the server.
  • MySQLREGEXP and RLIKE (synonyms) for match; REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_REPLACE, REGEXP_INSTR from MySQL 8.0+. Older versions only supported REGEXP (POSIX BRE).
  • SQL Server — no native regex. Workarounds: complex LIKE patterns with character classes (LIKE '[A-Z]%'), CLR functions written in C#, or STRING_SPLIT plus row-level logic.
  • SnowflakeREGEXP, RLIKE, REGEXP_LIKE, REGEXP_SUBSTR, REGEXP_REPLACE, REGEXP_INSTR, REGEXP_COUNT. PCRE-flavoured engine; supports (?i) inline flags.
  • BigQueryREGEXP_CONTAINS, REGEXP_EXTRACT, REGEXP_EXTRACT_ALL, REGEXP_REPLACE, REGEXP_INSTR. RE2 engine — no back-references, no lookahead. Performance is linear; expressiveness is slightly reduced.

The seven regex primitives every SQL interview probes.

  • Anchors^ start, $ end. Critical for "the whole string matches X."
  • Character classes[A-Z], [^@] (not at-sign), \d, \w, \s.
  • Quantifiers? zero-or-one, * zero-or-more, + one-or-more, {n,m} between n and m.
  • Alternation(cat|dog) matches either.
  • Capture groups( … ) captures a slice for extraction.
  • Back-references\1, \2 inside the pattern to match captured text again (not supported on BigQuery's RE2).
  • Inline flags(?i) case-insensitive, (?s) dot-matches-newline, etc.

SIMILAR TO — the SQL-standard half-breed.

  • Postgres also supports SIMILAR TO, defined by the SQL standard. Syntax is somewhere between LIKE and regex: it uses % and _ as wildcards (like LIKE) but also accepts |, *, +, ?, ( … ) (like regex).
  • In practice almost no one uses SIMILAR TO. Either you need regex (use ~) or you don't (use LIKE). Worth recognising on a code review; not worth writing.

Performance shape — regex is always a full scan unless paired with an external index.

  • A B-tree on the column cannot accelerate a regex predicate (except for trivial cases the planner can rewrite into anchored LIKE).
  • Postgres' pg_trgm extension can index regex predicates that contain a substantial literal substring (pg_trgm's text_ops GIN index supports ~, ~*, LIKE, ILIKE).
  • MySQL FULLTEXT and BigQuery SEARCH are not regex-aware — they tokenise and index whole words.

Worked example — extract the domain from an email column

Detailed explanation. A daily reporting query needs the email domain in a new column for an aggregation by provider — gmail.com, yahoo.com, acme.io. The choice is between five layers: SUBSTR + INSTR, SPLIT_PART, REGEXP_EXTRACT, REGEXP_MATCH, or a chain of RIGHT(POSITION()) calls. Regex is the cleanest single-call answer.

Question. Given a users.email column, write a query that returns email, username (before @), and domain (after @) — using regex capture groups on Postgres, MySQL, Snowflake, and BigQuery. Then convert the dialect-specific functions into a single portable shape.

Input (sample rows).

Code.

-- Postgres
SELECT
  email,
  (regexp_match(email, '^([^@]+)@(.+)$'))[1] AS username,
  (regexp_match(email, '^([^@]+)@(.+)$'))[2] AS domain
FROM users;

-- MySQL 8+
SELECT
  email,
  REGEXP_SUBSTR(email, '^[^@]+')           AS username,
  REGEXP_SUBSTR(email, '(?<=@).+$')        AS domain
FROM users;

-- Snowflake
SELECT
  email,
  REGEXP_SUBSTR(email, '^[^@]+', 1, 1)     AS username,
  REGEXP_SUBSTR(email, '@(.+)$', 1, 1, 'e', 1) AS domain
FROM users;

-- BigQuery
SELECT
  email,
  REGEXP_EXTRACT(email, r'^([^@]+)@')      AS username,
  REGEXP_EXTRACT(email, r'@(.+)$')         AS domain
FROM users;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pattern ^([^@]+)@(.+)$ — anchored at both ends. Group 1 is "one or more non-@ characters" (the username). Then a literal @. Group 2 is "one or more of any character" (the domain).
  2. Row 1 (alice@example.com) — Group 1 = alice, Group 2 = example.com.
  3. Row 2 (bob.jones@acme.io) — Group 1 = bob.jones (dots are allowed in [^@]), Group 2 = acme.io.
  4. Row 3 (carol+promo@gmail.com) — Group 1 = carol+promo (+ is a literal inside [^@]), Group 2 = gmail.com.
  5. Row 4 (dan@subdomain.example.co.uk) — Group 1 = dan, Group 2 = subdomain.example.co.uk — multi-level domain captured cleanly.
  6. Row 5 (not-an-email) — no @ → no match → NULL in both username and domain columns.
  7. Why the four dialects look different — Postgres' regexp_match returns a text array; MySQL's REGEXP_SUBSTR returns a string and uses lookbehind for the domain; Snowflake's REGEXP_SUBSTR has a 6-argument form where the 6th argument is the capture-group index; BigQuery's REGEXP_EXTRACT returns the first capture group directly.

Output.

email username domain
alice@example.com alice example.com
bob.jones@acme.io bob.jones acme.io
carol+promo@gmail.com carol+promo gmail.com
dan@subdomain.example.co.uk dan subdomain.example.co.uk
not-an-email NULL NULL

Rule of thumb. Extraction queries (domain from email, slug from URL, prefix from ticker symbol) are where regex pays for itself. Pure validation queries (is this a valid email shape?) are border cases — email LIKE '%@%.%' is 90 % as good as a strict regex and is supported on every dialect, including SQL Server.

Worked example — back-references for finding duplicate words in a comment column

Detailed explanation. Back-references — \1, \2 inside the same regex — let you match repeated content. They are not supported in BigQuery's RE2 engine, but Postgres, MySQL, and Snowflake all have them. A common cleaning task is "find every row whose comment column contains the same word twice in a row" ("the the", "on on", "is is") — a classic copy-paste typo that's easy to spot with back-references and very awkward to spot without.

Question. Write a Postgres query that returns every row whose comment contains a doubled word (case-insensitive), and the duplicated word itself. Then explain why BigQuery cannot run the same query and what the workaround looks like.

Input (sample rows).

id comment
1 The the user reported a bug.
2 Please please double check.
3 The user reported a bug.
4 I I think we need to investigate.
5 Working as expected.

Code.

-- Postgres — \1 back-reference + (?i) inline case flag
SELECT
  id,
  comment,
  (regexp_match(comment, '(?i)\m(\w+)\s+\1\m'))[1] AS doubled_word
FROM notes
WHERE comment ~* '\m(\w+)\s+\1\m';

-- Snowflake — same pattern, double-escape backslashes for the SQL string
SELECT
  id,
  comment,
  REGEXP_SUBSTR(comment, '(?i)\\b(\\w+)\\s+\\1\\b', 1, 1, 'e', 1) AS doubled_word
FROM notes
WHERE REGEXP_LIKE(comment, '(?i).*\\b(\\w+)\\s+\\1\\b.*');

-- BigQuery — RE2 engine does NOT support \1 back-references.
-- Workaround: enumerate doubled-word candidates in code, or use SPLIT + zip-pair logic.
WITH tokens AS (
  SELECT id, comment, w, pos
  FROM notes,
       UNNEST(REGEXP_EXTRACT_ALL(LOWER(comment), r'\w+')) AS w WITH OFFSET pos
)
SELECT a.id, n.comment, a.w AS doubled_word
FROM tokens a
JOIN tokens b ON a.id = b.id AND b.pos = a.pos + 1 AND a.w = b.w
JOIN notes n ON n.id = a.id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pattern \m(\w+)\s+\1\m\m is a word-start anchor (Postgres); (\w+) captures one or more word characters; \s+ matches whitespace; \1 matches the same text the first group captured; \m ends at a word boundary.
  2. (?i) inline flag — makes the match case-insensitive. "The the" and "THE the" both match.
  3. Row 1 ("The the user…")(\w+) captures "The", \s+ matches the space, \1 then needs "The" again (case-insensitive) — finds "the". Match.
  4. Row 4 ("I I think…")(\w+) captures "I", \s+ matches the space, \1 matches the next "I". Match.
  5. BigQuery cannot run this because RE2 explicitly forbids back-references (they make worst-case matching exponential — RE2 is built for linear-time guarantees). The workaround tokenises the string, joins each token to its neighbour at offset +1, and filters where the tokens are equal.
  6. Engine difference matters in interviews — saying "RE2 doesn't support back-references" is the senior signal. Don't promise back-references on BigQuery; promise a tokenise-and-join pattern instead.

Output.

id comment doubled_word
1 The the user reported a bug. the
2 Please please double check. please
4 I I think we need to investigate. i

Rule of thumb. Reach for back-references whenever the match shape depends on "the same thing appearing twice" — duplicate words, repeated digits, repeated tags. If you're targeting BigQuery, tokenise into a table and join; on Postgres / MySQL / Snowflake the regex form is one line.

SQL interview question on validating PII patterns

A senior probe sounds like: "Compliance flagged that one of our analytics tables may contain SSNs in a free-text comment column. Write a query that lists every row whose comment contains a string that looks like a US SSN (XXX-XX-XXXX, where X is a digit), and the matched substring." The interviewer is checking three things: character class ([0-9] vs \d), quantifier ({N}), and capture / extract semantics.

Solution Using REGEXP_LIKE + REGEXP_SUBSTR (Snowflake) and the equivalent Postgres pattern

-- Snowflake
SELECT
  id,
  comment,
  REGEXP_SUBSTR(comment, '\\b[0-9]{3}-[0-9]{2}-[0-9]{4}\\b') AS ssn_match
FROM customer_notes
WHERE REGEXP_LIKE(comment, '.*\\b[0-9]{3}-[0-9]{2}-[0-9]{4}\\b.*');

-- Postgres (operator + function form)
SELECT
  id,
  comment,
  (regexp_match(comment, '\m[0-9]{3}-[0-9]{2}-[0-9]{4}\M'))[1] AS ssn_match
FROM customer_notes
WHERE comment ~ '\m[0-9]{3}-[0-9]{2}-[0-9]{4}\M';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

comment regex find match? ssn_match
Customer wants refund on order 123-45-6789 123-45-6789 yes 123-45-6789
Tracking number 999-12345-67 shape doesn't fit \d{3}-\d{2}-\d{4} no NULL
Address is 100-12-3456 Main St 100-12-3456 yes 100-12-3456
Phone (555) 123-4567 no SSN-shaped match no NULL
ID 12-345-6789 hyphens but wrong digit counts no NULL

Output:

id comment (snippet) ssn_match
4711 Customer wants refund on order 123-45-6789 123-45-6789
4719 Address is 100-12-3456 Main St 100-12-3456
4733 Old record: 555-44-3322 — please redact 555-44-3322
4760 Backup file 999-88-7766 attached 999-88-7766

Why this works — concept by concept:

  • Character class + quantifier[0-9]{3}-[0-9]{2}-[0-9]{4} is impossible to express in plain LIKE. Regex is the right layer.
  • Word boundaries (\b / \m / \M) — protect against false positives like XX-123-45-6789-XX where the digits are part of a longer code. Snowflake uses \b; Postgres uses \m (start) and \M (end).
  • REGEXP_LIKE for the filter, REGEXP_SUBSTR for the extract — paired like this you can both flag the row and return the offending substring for the compliance report.
  • Backslash escaping in SQL strings'\\b' in Snowflake doubles the backslash because SQL string literals consume one level of escaping. Postgres' ~ operator and regexp_match accept single-backslash escapes in single-quoted patterns ('\m').
  • Cost — full scan of the comment column; for a 50 M row table this is typically a 30–60 s scheduled job, not a sub-second user query. For the latter, route compliance scans to a dedicated index or an external text engine.

SQL
Topic — pattern matching
REGEXP + capture group problems (SQL)

Practice →


5. Performance — trigram indexes, GIN, and leading wildcards

sql pattern matching becomes a system-design conversation the moment "make LIKE '%abc%' fast" is asked

The mental model in one line: a B-tree index on a string column accelerates LIKE 'prefix%' (range scan), accelerates nothing for LIKE '%suffix' or LIKE '%substring%', and that gap is what trigram indexes, FULLTEXT, search-optimisation services, and external search engines exist to close. Senior interviews on sql pattern matching spend most of their time here.

Visual diagram of a Postgres trigram index — top a string 'cathedral' broken into trigrams ' ca, cat, ath, the, hed, edr, dra, ral, al '; below a GIN-style inverted index mapping each trigram to a list of row ids; a query LIKE '%edr%' showing fast lookup; on a light PipeCode card.

Why a B-tree fails on leading wildcards.

  • A B-tree index is a sorted structure keyed by the column's prefix. The planner can find every row where col >= 'abc' and col < 'abd' in O(log N + matches) — that's the index range scan that powers LIKE 'abc%'.
  • For LIKE '%abc%', there is no usable prefix — the matching rows are scattered everywhere in the sort order. The planner falls back to a sequential scan of every row.

Postgres — pg_trgm + GIN is the senior answer.

  • pg_trgm is a Postgres extension that breaks each string into trigrams — overlapping three-character windows. 'cathedral' becomes ' c', ' ca', 'cat', 'ath', 'the', 'hed', 'edr', 'dra', 'ral', 'al ', 'l '.
  • A GIN (Generalized Inverted iNdex) index stores every trigram once, mapped to the list of row ids that contain it.
  • For LIKE '%edr%', the planner extracts the trigram 'edr' from the pattern, looks it up in the GIN index, and returns the candidate row ids — typically a tiny fraction of the table.
  • For ILIKE '%abc%', the same shape works because the trigrams are stored case-insensitively when the index uses gin_trgm_ops (Postgres normalises to lowercase internally).
  • Trigger threshold — patterns shorter than 3 characters cannot use the trigram index because there are no full trigrams to look up. LIKE '%ab%' either falls back to a sequential scan or matches every row that contains the trigrams '_ab' and 'ab_' recombined.

MySQL — FULLTEXT for whole-word search.

  • CREATE FULLTEXT INDEX idx_notes_body ON notes(body) builds an inverted index of whole words, not trigrams.
  • Search shape: WHERE MATCH(body) AGAINST('timeout' IN NATURAL LANGUAGE MODE) — different from LIKE, returns rows ranked by relevance.
  • IN BOOLEAN MODE supports +required, -excluded, *wildcard semantics.
  • Does not accelerate LIKE '%abc%' directly. The LIKE query stays sequential; you have to rewrite the application to use MATCH AGAINST.

Snowflake — search optimisation service.

  • ALTER TABLE notes ADD SEARCH OPTIMIZATION ON EQUALITY(body), SUBSTRING(body); enables the service.
  • Internally Snowflake builds a hidden index that accelerates =, LIKE, ILIKE, RLIKE, REGEXP_LIKE, and SEARCH() predicates.
  • Cost model: storage + compute uplift; only worth it on tables with 10⁸+ rows and frequent point / contains queries.

BigQuery — the SEARCH function and search indexes.

  • CREATE SEARCH INDEX idx_notes_body ON notes(body) builds a token-based index.
  • Query: WHERE SEARCH(body, 'timeout') — tokenises like FULLTEXT.
  • Does not directly accelerate LIKE '%abc%'; you rewrite to SEARCH() to get the index path.

External search — Elasticsearch / OpenSearch.

  • For petabyte-scale text search, dedicated faceting, geo / fuzzy / ranking — none of which the warehouse models well — push the column to ES via a CDC pipeline (Debezium → Kafka → ES sink).
  • The warehouse keeps the system of record; ES serves search queries with millisecond p95.

The decision tree.

  • Pattern is anchored ('abc%') → B-tree index on the column. Free.
  • Pattern is unanchored, table is small (< 1 M rows) → just run the sequential LIKE '%abc%'. Operationally fine.
  • Pattern is unanchored, table is medium (1 M – 100 M rows), Postgrespg_trgm + GIN. The mainline answer.
  • Pattern is unanchored, table is medium, MySQL / Snowflake / BigQuery → FULLTEXT / search optimisation / SEARCH index respectively, with a SQL rewrite to use the right syntax.
  • Pattern is unanchored, table is large (100 M+), or queries need ranking / faceting → push to Elasticsearch / OpenSearch. The warehouse is not the right tool.

Worked example — build a trigram-backed LIKE '%abc%' on Postgres

Detailed explanation. A notes.body column has 30 million rows. The query WHERE body ILIKE '%timeout%' runs in 42 seconds today (full scan). After installing pg_trgm and creating a GIN index, the same query runs in 18 milliseconds. The mechanic worth understanding is the trigram extraction → GIN lookup → bitmap scan pipeline.

Question. Show the full set-up: create the extension, build the GIN index, run the slow query before and the fast query after, then explain the planner's choices with EXPLAIN ANALYZE.

Input (table shape).

column type rows
id bigint 30,000,000
created_at timestamptz
body text avg 280 chars

Code.

-- 1. One-time setup
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_notes_body_trgm
  ON notes USING GIN (body gin_trgm_ops);

-- 2. The query (unchanged shape)
SELECT id, body
FROM notes
WHERE body ILIKE '%timeout%';

-- 3. Inspect the plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, body FROM notes WHERE body ILIKE '%timeout%';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. CREATE EXTENSION pg_trgm — loads the trigram operators (%, <%, %>) and the GIN opclasses (gin_trgm_ops).
  2. CREATE INDEX … USING GIN (body gin_trgm_ops) — builds an inverted index. Internally, for each row Postgres extracts every trigram of the body column and adds (trigram → row_id) pairs to the index. Build time on 30 M rows is typically 20–40 minutes; disk size ≈ 2× the table.
  3. body ILIKE '%timeout%' — the planner now considers the GIN index. It extracts the trigrams from the pattern ('tim', 'ime', 'meo', 'eou', 'out') and intersects the row-id lists for each.
  4. Bitmap index scan + recheck — the planner emits a bitmap of candidate row ids, then rechecks the predicate against the actual row (because some trigram matches may be false positives at the pattern level). Typical candidate count is 0.01 % of the table.
  5. Result — the same query plan that used to be a Seq Scan is now a Bitmap Heap Scan on idx_notes_body_trgm plus a recheck for body ~~* '%timeout%'.

Output (before / after).

Phase Plan Buffers Wall time (p50)
Before Seq Scan on notes 580,221 shared hit 42.0 s
After Bitmap Index Scan + Recheck on idx_notes_body_trgm 380 shared hit 18 ms

Rule of thumb. Trigram + GIN is the single biggest unlock for "contains-search" workloads on Postgres. The trade-off is index size (≈ 2× table) and slower writes (every insert touches the index). For columns that are append-mostly and read-heavy, the trade is overwhelmingly worth it.

Worked example — MySQL MATCH AGAINST boolean search on a comments column

Detailed explanation. MySQL's FULLTEXT index is the equivalent of Postgres' trigram GIN for whole-word text search, but the API is different: you don't LIKE '%word%' against a FULLTEXT-indexed column — you use MATCH(col) AGAINST('word'). Forgetting to rewrite the query is the most common reason a FULLTEXT index "doesn't help."

Question. A forum_posts.body MySQL column has a FULLTEXT index. The existing query WHERE body LIKE '%refactor%' is still slow. Rewrite it three ways: natural-language mode, boolean mode with +required and -excluded, and a phrase search. Then explain why each one uses the index and why the original LIKE did not.

Input (sample rows).

id body
1 We need to refactor the orders module.
2 The refactoring went well; tests pass.
3 I'm not a fan of refactor for refactor's sake.
4 Please review the PR before merging.
5 Refactoring duplicates, but keeping the auth flow.

Code.

-- 0. The FULLTEXT index (already present in this scenario)
ALTER TABLE forum_posts ADD FULLTEXT INDEX idx_body (body);

-- 1. Natural-language mode (default) — relevance-ranked
SELECT id, body
FROM forum_posts
WHERE MATCH(body) AGAINST('refactor');

-- 2. Boolean mode — required `refactor`, excluded `auth`
SELECT id, body
FROM forum_posts
WHERE MATCH(body) AGAINST('+refactor -auth' IN BOOLEAN MODE);

-- 3. Phrase search — double-quoted exact phrase
SELECT id, body
FROM forum_posts
WHERE MATCH(body) AGAINST('"refactor the"' IN BOOLEAN MODE);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The original LIKE '%refactor%' doesn't use the FULLTEXT index at all — FULLTEXT is only consulted when the query uses MATCH(col) AGAINST(...). Rewriting is mandatory.
  2. Natural-language mode — stems refactor into a search token; rows 1, 2, 3 match (all contain refactor or refactoring). Row 5 (Refactoring) matches because MySQL's FULLTEXT stems word endings.
  3. Boolean mode with +refactor -auth+ makes refactor mandatory; -auth excludes any row whose body contains auth. Row 5 is excluded because of auth flow.
  4. Phrase search — double-quoted "refactor the" requires that exact two-word sequence. Only row 1 matches; row 3 (refactor for refactor's sake) does not because the phrase shape is different.
  5. Why FULLTEXT but not LIKE — FULLTEXT tokenises the body once at insert time, building an inverted index of (token → row_id). MATCH AGAINST consults that index. LIKE is a per-row substring matcher that ignores the index entirely.

Output.

Query Matches Notes
natural-language refactor 1, 2, 3, 5 stems to refactor*
boolean +refactor -auth 1, 2, 3 row 5 excluded by auth
phrase "refactor the" 1 exact phrase

Rule of thumb. Adding a FULLTEXT index without rewriting the queries to MATCH AGAINST gives you index storage cost with zero query benefit. Always pair the schema change with the query rewrite, and verify with EXPLAIN that the plan uses fulltext and not ALL (sequential scan).

Worked example — Postgres tsvector + tsquery for natural-language search

Detailed explanation. pg_trgm is great for substring-containment search; tsvector is great for natural-language search — stemming, stop-word removal, ranking. They serve different jobs. A common interview probe is: "Which one do you reach for, and how do you wire it up?"

Question. A blog_posts table needs natural-language search across title and body with relevance ranking. Show the schema change, the GIN index, and the search query with ranking.

Input (sample rows).

id title body
1 Refactoring monoliths We refactored the orders service this quarter.
2 The art of refactoring Refactor your code before you scale your team.
3 Auth flow migration Moving from sessions to JWT was a multi-week refactor.
4 Holiday party recap The whole team turned up.

Code.

-- Schema change — add a generated tsvector column and a GIN index
ALTER TABLE blog_posts
ADD COLUMN search_doc tsvector
GENERATED ALWAYS AS (
  setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
  setweight(to_tsvector('english', coalesce(body,  '')), 'B')
) STORED;

CREATE INDEX idx_blog_posts_search_doc
  ON blog_posts USING GIN (search_doc);

-- Query — natural language with ranking
SELECT
  id,
  title,
  ts_rank(search_doc, to_tsquery('english', 'refactor & code')) AS score
FROM blog_posts
WHERE search_doc @@ to_tsquery('english', 'refactor & code')
ORDER BY score DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Generated tsvector column — Postgres computes a normalised, stemmed, stop-word-stripped representation of title + body on every insert / update. setweight tags title tokens with weight A (highest) and body tokens with weight B.
  2. GIN index on search_doc — turns the inverted-index lookups into millisecond bitmap scans.
  3. to_tsquery('english', 'refactor & code') — parses the user input into a tsquery with English stemming. The & is the boolean AND operator; | would be OR, ! would be NOT.
  4. search_doc @@ tsquery — the match operator. Returns true if every required token is present.
  5. ts_rank(search_doc, tsquery) — assigns a relevance score combining token frequency and the weights from step 1. Row 2 ranks above row 1 because both refactor and code are present.

Output.

id title score
2 The art of refactoring 0.41
1 Refactoring monoliths 0.18

Rule of thumb. Reach for tsvector + tsquery when the user is writing a natural-language search box ("refactor code", "auth migration") and expects ranking. Reach for pg_trgm when the query is a programmatic LIKE '%substring%'. They're complementary, not competitive.

SQL interview question on choosing between trigram, FULLTEXT, and external search

A senior probe: "We have 100 million customer support tickets in a Postgres column. Analysts run LIKE '%refund%', LIKE '%timeout%', LIKE '%fraud%' style queries dozens of times per day, plus ad-hoc free-form searches. Walk me through how you'd architect the search layer." The right answer compares three options on five axes.

Solution Using a tiered architecture — trigram first, external ES second

-- Tier 1 — Postgres pg_trgm + GIN on the canonical column
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_tickets_body_trgm
  ON tickets USING GIN (body gin_trgm_ops);

-- Tier 2 — When queries grow beyond contains-search (ranking, fuzzy, faceting),
--          stream tickets to Elasticsearch via Debezium → Kafka → ES sink.
--          Keep Postgres as the system of record; ES serves search-only.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage What it solves What it doesn't
pg_trgm GIN on body LIKE '%refund%', ILIKE '%TimeOut%', regex with literal substrings Boolean operators, ranking by relevance, fuzzy / typo-tolerant search
Add tsvector + tsquery + GIN natural-language word search inside Postgres (@@) Multi-language analyzers, complex aggregations on facets
Add Debezium → Kafka → ES ranked search, facets, geo, multi-language analyzers, p95 milliseconds New cluster to operate; eventual consistency window

Output:

Query shape Latency target Right tier
LIKE 'PREFIX-%' < 10 ms B-tree on column (no change)
ILIKE '%refund%' < 50 ms tier 1 — pg_trgm GIN
ILIKE '%refnd%' typo-tolerant < 100 ms tier 3 — Elasticsearch with fuzzy match
Boolean: +refund -duplicate +urgent < 200 ms tier 2 or 3 — tsquery or ES
Facet count by status, agent, category for "refund" < 500 ms tier 3 — ES with aggregations

Why this works — concept by concept:

  • Trigram index as the 80 % answer — for plain contains-search, pg_trgm covers most analyst queries with one extension and one index. No new system to operate.
  • tsvector for natural-language search — when you need stemming (run, running, runs all match), stop-word handling, and ranking, Postgres' built-in full-text search is the middle tier without leaving the database.
  • External ES for the long tail — once you need fuzzy matching, multi-language analyzers, geo queries, or rich facets, Elasticsearch / OpenSearch becomes the right tool. CDC keeps Postgres as the source of truth.
  • Operational reality — you don't deploy ES on day one for a feature that fits inside pg_trgm. You upgrade tiers when query shape demands it, not before.
  • Cost — pg_trgm = +2× column storage, +20–40 % write amplification; tsvector = +30–80 % column storage; ES cluster = a separate infra footprint with its own ops budget.

SQL
Topic — indexing
Indexing problems (B-tree, GIN, trigram)

Practice →


Pattern-matching recipes (cheat sheet)

  • Starts with. col LIKE 'prefix%' — anchored prefix, index range scan, the fastest shape.
  • Ends with. col LIKE '%suffix' — full scan; fix with a reversed-string functional index for one-off suffix searches, or trigram for general unanchored.
  • Contains (small dataset). col LIKE '%substring%' — full scan is fine if the table is under ~1 M rows or the query is rare.
  • Contains (big dataset). Postgres → pg_trgm + GIN. MySQL → FULLTEXT MATCH AGAINST. Snowflake → search optimisation service. BigQuery → search index + SEARCH().
  • Case-insensitive contains. Postgres / Snowflake → col ILIKE '%abc%'. MySQL → LIKE (default CI). SQL Server → LIKE '%abc%' COLLATE …_CI_AS. BigQuery → REGEXP_CONTAINS(col, r'(?i)abc'). Portable → LOWER(col) LIKE LOWER('%abc%') (slow, kills indexes).
  • Email validate (quick). col LIKE '%@%.%' — 90 % of the strictness for 10 % of the work, supported everywhere.
  • Email validate (strict). col ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$' — character class + quantifier; needs regex.
  • Extract domain. (regexp_match(email, '^[^@]+@(.+)$'))[1] on Postgres; REGEXP_EXTRACT(email, r'@(.+)$') on BigQuery.
  • Exactly N characters. col LIKE '___' (three of anything); or col ~ '^.{3}$' if you want to be explicit.
  • Escape a literal %. col LIKE '50\%' ESCAPE '\' — match the four-character string 50%.
  • Multi-pattern OR. Postgres → col ~ 'foo|bar|baz' (regex alternation) or chained OR col LIKE. Snowflake → LIKE_ANY('%foo%', '%bar%').
  • Negation. col NOT LIKE '%spam%' or col !~ 'spam' — same scan cost as the positive shape.
  • Anchored regex on indexed column. Postgres can use pg_trgm to accelerate col ~ 'abc' if 'abc' is a literal substring of the pattern. Useful for partial regex queries.

Frequently asked questions

LIKE vs REGEXP — when should I use each?

Reach for LIKE whenever the pattern is expressible with just % and _ wildcards — starts-with, ends-with, contains, fixed-length-shaped strings. It's simpler, faster (when anchored, the B-tree index helps), and supported on every SQL dialect including SQL Server. Switch to REGEXP only when you need character classes ([A-Z]), quantifiers ({3,5}), alternation ((cat|dog)), capture groups, or back-references — things LIKE literally cannot express. A useful test: if you can write the pattern with only %, _, and literal characters, stay on LIKE.

Why is my LIKE '%abc%' query slow?

Because the leading % means the database cannot use a B-tree index on the column — it has to read every row and run the matcher against it. On a 100-row dev table this is invisible; on a 10-million-row prod table it's a 30–60 second sequential scan. The three production fixes are: (1) anchor the pattern to the start if possible (LIKE 'abc%' is a range scan, blazingly fast); (2) install a trigram index on Postgres (CREATE INDEX … USING GIN (col gin_trgm_ops)); (3) move text search to a dedicated index — FULLTEXT on MySQL, search optimisation on Snowflake, SEARCH index on BigQuery, or external Elasticsearch for petabyte-scale.

Is LIKE case-sensitive?

Depends on the dialect and the column collation. Postgres, Snowflake, and BigQuery treat LIKE as case-sensitive by default — use ILIKE (Postgres / Snowflake) or REGEXP_CONTAINS(col, r'(?i)…') (BigQuery) for case-insensitive matches. MySQL treats LIKE as case-insensitive by default because the standard collation utf8mb4_0900_ai_ci is case-insensitive — switch to a _bin or _cs collation if you need case-sensitivity. SQL Server defers to the column's collation entirely, and you can override per-query with COLLATE Latin1_General_CI_AS (case-insensitive) or Latin1_General_CS_AS (case-sensitive).

How do I escape % and _ in a LIKE query?

Use the ESCAPE clause to designate any single character as the escape: WHERE col LIKE '50\%' ESCAPE '\' matches the literal four-character string 50%. The escape character can be anything not appearing in your data — !, #, | are common choices when \ is already used (e.g. file paths). Without the ESCAPE clause, \% is just two literal characters \ and %-as-wildcard, which usually isn't what you want. This works the same on Postgres, MySQL, SQL Server, Snowflake, and BigQuery — it's part of the SQL standard.

Does REGEXP use an index?

In general no — a regex predicate forces a full sequential scan because the planner cannot prove the pattern matches a contiguous range in a sorted B-tree. The big exception is Postgres with pg_trgm + GIN: if the regex contains a substantial literal substring (≥ 3 characters), the planner extracts the trigrams from the pattern and uses the GIN index to narrow the candidate set, then rechecks each candidate against the full regex. MySQL's FULLTEXT and BigQuery's SEARCH are word-based, not regex-aware, so they don't accelerate REGEXP directly. The senior answer: pair regex with a trigram index on Postgres, or rewrite to LIKE 'literal-prefix%' AND col ~ 'rest-of-pattern' so the planner can use a B-tree for the anchored part.

Is LIKE '%abc%' with a leading wildcard ever production-safe?

Yes, in two scenarios. First, when the table is small — under roughly 1 million rows on a warehouse, under 100 thousand on a transactional database — the full scan finishes fast enough that nobody cares. Second, when the column has a trigram index (pg_trgm on Postgres, FULLTEXT on MySQL, search optimisation on Snowflake) the leading wildcard is no longer a problem because the index handles the unanchored case directly. The danger zone is "table grew without anyone noticing" — a query that ran in 200 ms at 100 K rows takes 60 seconds at 30 M, and the leading wildcard is almost always the culprit. Run EXPLAIN ANALYZE on every LIKE '%…%' query in your nightly job set, and add trigram indexes anywhere the scan dominates.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every pattern-matching shape above ships with hands-on practice rooms where you write the SQL, run it against real data, and watch the planner choose the right index. Start with the pattern-matching library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice pattern matching now →
String function drills →

Top comments (0)