Every production data pipeline eventually grows a layer of sql query with wildcard logic — email validation, log line filtering, name normalisation, URL bucket extraction, fuzzy joins between vendor feeds. The shape is always the same: a string column, a pattern, and a WHERE clause that decides which rows survive. Interviewers love this surface because the right answer changes with dataset size, dialect, and where the leading wildcard sits — three knobs that separate a junior **like in sql** answer from a senior sql pattern matching plan that holds at scale.
This guide walks the three pattern-matching layers you will reach for in real data engineering work: **sql like operator** with % and _ for plain prefix and suffix matches, ILIKE plus collation tricks for case-insensitive search across five dialects, and **sql regexp** with capture groups for structured extraction. We then close with the operational story interviewers actually care about — why LIKE '%abc%' becomes the #1 prod offender at 10 million rows, how Postgres' pg_trgm + GIN index turns it back into a bitmap lookup, and where MySQL's FULLTEXT, Snowflake's search-optimization service, and an external Elasticsearch / OpenSearch tier each take over. Every section ships with a worked teaching example and a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why the chosen pattern beats the naive one. By the end you'll have a reusable mental model for sql wildcards that anchors patterns to the start whenever possible, escapes literal % cleanly, and reaches for trigram indexes before the team meeting where someone proposes pulling Elasticsearch into the stack.
When you want hands-on reps immediately after reading, drill the pattern-matching practice library →, pair it with string function problems →, and pressure-test it with filtering drills →.
On this page
- Why pattern matching shows up in every prod query
- LIKE basics — % and _ wildcards plus escape characters
- ILIKE and case-insensitive matching across dialects
- REGEXP, SIMILAR TO, and regex functions across dialects
- Performance — trigram indexes, GIN, and leading wildcards
- Pattern-matching recipes (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why pattern matching shows up in every prod query
sql query with wildcard lives at the intersection of data cleaning, search, and analytics
The one-sentence invariant: every dataset eventually grows a string column where the only way to filter, validate, or extract a slice is to match against a pattern — and the pattern's shape determines whether your query runs in milliseconds or scans the whole table. Once you internalise the three layers — LIKE for plain % / _ wildcards, ILIKE for case-insensitive convenience, REGEXP for structured capture — the rest of the sql wildcards discussion is a sequence of trade-offs: simplicity vs power, anchored vs unanchored, indexed vs full scan.
Where pattern matching shows up in real DE work.
-
Email validation.
WHERE email LIKE '%@%.%'is the quick-and-dirty gate; the regex^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$is the rigorous one. -
URL bucketing.
WHERE url LIKE '/checkout/%'to bucket pages into funnels, or a regex to peel out the campaign id from query strings. -
Log line filtering.
WHERE message LIKE '%ERROR%'for a quick triage; trigram-backedWHERE message ILIKE '%timeout%'once the table grows. -
Name normalisation.
WHERE last_name ILIKE 'mc%'to catchMcKinsey,McKinney, etc., regardless of capitalisation. - Fuzzy joins. Vendor A writes "Apple Inc.", vendor B writes "Apple, Inc" — a normalised LIKE or a trigram similarity score reconciles them.
-
Compliance scans.
WHERE col REGEXP '^[0-9]{3}-[0-9]{2}-[0-9]{4}$'to detect SSN-shaped strings in a column that should not contain PII.
The three layers in one sentence each.
-
LIKEandILIKE— string operators with two wildcards (%for zero-or-more,_for exactly-one) and an optional escape clause. Cheap to write, supported everywhere, fast when the pattern is anchored at the start. -
REGEXP/~/REGEXP_LIKE— full regular-expression matching with character classes, quantifiers, alternation, and capture groups. Vastly more expressive; always scans unless paired with a specialised index. -
Search / FULLTEXT / trigram indexes —
pg_trgm+ GIN on Postgres,FULLTEXT MATCH AGAINSTon MySQL, search optimisation service on Snowflake, theSEARCHfunction on BigQuery, and external Elasticsearch / OpenSearch for the truly large case.
Why "naive LIKE" is the #1 prod performance offender.
- A pattern that starts with
%(e.g.'%timeout%') cannot use a B-tree index on the column — the database has to read every row and run the matcher. - A pattern that starts with a literal prefix (e.g.
'/checkout/%') is a range scan: the database walks the index from'/checkout/'to'/checkout0'(the next byte) and only reads those rows. - The 100-row dev table hides this completely; the 10-million-row prod table makes it the slowest query in the warehouse.
- The fix is one of three things: (a) anchor the pattern to the start, (b) install a trigram or FULLTEXT index, or (c) move text search to a dedicated engine.
What interviewers listen for.
- Do you immediately distinguish anchored (
'abc%') from unanchored ('%abc%') patterns when performance comes up? — senior signal. - Do you reach for
ILIKEon Postgres / Snowflake andCOLLATEon SQL Server when case-insensitivity is asked? — dialect literacy. - Do you mention
pg_trgm+ GIN when the interviewer asks "how do you makeLIKE '%foo%'fast?" — production-grade answer. - Do you reach for REGEXP capture groups when the question is "extract the domain from the email"? — required answer.
Worked example — pick the right layer for five real queries
Detailed explanation. Most pattern-matching choices are obvious once you put the query into one of three buckets — exact / prefix / suffix (LIKE), case-insensitive substring (ILIKE or LOWER + LIKE), structured extraction or validation (REGEXP). Picking the wrong layer is the most common interview slip: people reach for regex when LIKE would do, or for LIKE when they need a regex.
Question. For each query below, pick the correct layer (LIKE / ILIKE / REGEXP) and justify it in one sentence:
- Find all rows where
urlstarts with/checkout/. - Find all rows where
namecontainssmithregardless of capitalisation. - Find all rows where
emailmatches a strict RFC-ish format. - Find all rows where
codeis exactly three uppercase letters. - Extract the domain portion of
emailinto a new column.
Input (sample rows).
| id | url | name | code | |
|---|---|---|---|---|
| 1 | /checkout/start | Smith, Alice | alice@acme.io | ABC |
| 2 | /cart | smith Bob | bob@@acme.io | A1C |
| 3 | /checkout/pay | Smithers | not-an-email | abc |
| 4 | /home | Doe | carol@beta.co | XYZ |
| 5 | /checkout/done | SMITH, Dan | dan@gamma.net | ABCD |
Code.
-- 1. starts-with → LIKE (anchored)
SELECT id FROM events WHERE url LIKE '/checkout/%';
-- 2. case-insensitive contains → ILIKE (Postgres / Snowflake)
SELECT id FROM events WHERE name ILIKE '%smith%';
-- 3. strict format → REGEXP (Postgres ~ operator)
SELECT id FROM events
WHERE email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$';
-- 4. exactly 3 uppercase letters → REGEXP (LIKE cannot express character class)
SELECT id FROM events WHERE code ~ '^[A-Z]{3}$';
-- 5. extract domain → REGEXP capture group
SELECT id, (regexp_match(email, '^[^@]+@(.+)$'))[1] AS domain FROM events;
Step-by-step explanation.
-
url LIKE '/checkout/%'— anchored prefix. The database can use a B-tree index onurland do a range scan from/checkout/to/checkout0. Rows 1, 3, 5 match. -
name ILIKE '%smith%'— leading wildcard kills the B-tree onname. The pattern is case-insensitive thanks toILIKE. Rows 1, 2, 3, 5 match (Smith, smith, Smithers, SMITH). -
email ~ '^...$'— only LIKE can't express "one-or-more of a character class." Rows 1, 4, 5 match; row 2 fails on@@; row 3 has no@. -
code ~ '^[A-Z]{3}$'— LIKE could express "three of any character" with'___', but not "three uppercase letters." Row 1 (ABC) matches; row 2 has a digit; row 3 is lowercase; row 5 has 4 chars. -
regexp_match(email, '^[^@]+@(.+)$')[1]— the capture group(.+)after the@extracts the domain into a new column.acme.io,beta.co,gamma.netfor the valid rows.
Output.
| Query | Layer | Matches | Why |
|---|---|---|---|
| 1 — url starts with /checkout/ | LIKE | 1, 3, 5 | anchored prefix, index range scan |
| 2 — name contains "smith" | ILIKE | 1, 2, 3, 5 | case-insensitive convenience |
| 3 — strict email format | REGEXP | 1, 4, 5 | needs character class + quantifier |
| 4 — exactly 3 uppercase | REGEXP | 1 | LIKE can't constrain to uppercase |
| 5 — extract domain | REGEXP capture | 1, 4, 5 | needs (.+) capture group |
Rule of thumb. Reach for LIKE first; bump to ILIKE when you need case-insensitivity on Postgres / Snowflake; only reach for REGEXP when you need character classes, quantifiers, alternation, or capture groups. Don't pay the regex tax for a query that LIKE would handle in five fewer characters.
Worked example — anatomy of a failed shortcut from LIKE to REGEXP
Detailed explanation. A common code-review smell is a regex where a LIKE would have done. The cost of the wrong choice is twofold: harder-to-read SQL and a planner that can no longer use a B-tree index. Walking through the rewrite both ways teaches the trade-off concretely.
Question. A junior engineer wrote WHERE order_id ~ '^ORD-[0-9]+$' to find every row whose order_id follows the format ORD-{digits}. Rewrite this two ways: (a) using LIKE plus an explicit length / digit check, and (b) using a stricter regex that pins the digit count. Compare the index usage of all three forms.
Input (sample rows).
| id | order_id |
|---|---|
| 1 | ORD-1001 |
| 2 | ORD-2002 |
| 3 | ord-3003 |
| 4 | ORDER-4004 |
| 5 | ORD-A2B |
| 6 | ORD- |
Code.
-- Original (regex, full scan)
SELECT id FROM orders WHERE order_id ~ '^ORD-[0-9]+$';
-- Rewrite (a) — LIKE with anchored prefix; relies on a separate check for "all digits after prefix"
SELECT id FROM orders
WHERE order_id LIKE 'ORD-%'
AND order_id NOT LIKE 'ORD-%[^0-9]%'; -- MySQL / SQL Server character class
-- Rewrite (b) — regex pinned to exactly N digits (4-8 in this catalogue)
SELECT id FROM orders WHERE order_id ~ '^ORD-[0-9]{4,8}$';
Step-by-step explanation.
-
Original regex — anchored at both ends, but the planner still does a full scan unless
pg_trgmis installed. The anchor^ORD-could be a B-tree range scan, but the regex predicate disables that path. -
Rewrite (a) — splits the predicate into an anchored
LIKE 'ORD-%'(B-tree-friendly) plus a content checkNOT LIKE 'ORD-%[^0-9]%'. The planner can use the B-tree onorder_idto narrow to candidates, then apply the second clause as a filter. -
Rewrite (b) — same regex shape but with
{4,8}instead of+. This rejectsORD-A2BandORD-cleanly and adds a useful length sanity check. -
Row 5 (
ORD-A2B) — fails original regex (letters), passes rewrite (a) only after theNOT LIKEcheck, fails rewrite (b). -
Row 6 (
ORD-) — has the prefix but no digits. Original regex rejects (the+requires at least one digit). Rewrite (a) rejects (the NOT LIKE check would not match — needs careful framing — depends on dialect). Rewrite (b) rejects ({4,8}requires at least 4 digits).
Output.
| Form | Index used | Matches | Notes |
|---|---|---|---|
~ '^ORD-[0-9]+$' |
none (seq scan) | 1, 2 | correct but slow on large tables |
LIKE 'ORD-%' AND NOT LIKE 'ORD-%[^0-9]%' |
B-tree range scan + filter | 1, 2 | fast on Postgres; dialect-sensitive bracket syntax |
~ '^ORD-[0-9]{4,8}$' + pg_trgm
|
GIN on ORD- literal |
1, 2 | best when trigram index is already present |
Rule of thumb. When a regex starts with a literal prefix (^ORD-), rewrite it to a LIKE 'ORD-%' predicate plus a secondary filter for the content. The B-tree range scan on the prefix is almost always the right plan on a transactional database; the secondary filter rarely costs more than the regex would have.
SQL interview question on picking the right pattern-matching layer
A common opener: "I have a WHERE email LIKE '%@gmail.com' query that runs nightly. It's slow and I can't add an index because of the leading %. How would you reason about it?" The interviewer is testing whether you can name the leading-wildcard problem, separate the correctness layer (LIKE handles it) from the performance layer (leading wildcard kills the index), and propose a fix without over-engineering.
Solution Using reverse-string trick + indexed range scan
-- Original (slow)
SELECT id FROM users WHERE email LIKE '%@gmail.com';
-- Fix: store reverse(email) and use an anchored prefix on the reversed value
ALTER TABLE users ADD COLUMN email_rev TEXT
GENERATED ALWAYS AS (reverse(email)) STORED;
CREATE INDEX idx_users_email_rev ON users (email_rev);
-- New query — anchored prefix → index range scan
SELECT id FROM users WHERE email_rev LIKE reverse('@gmail.com') || '%';
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | Add generated column email_rev = reverse(email)
|
every row gets a reversed copy of the email, maintained automatically |
| 2 | Create B-tree index on email_rev
|
lookups on email_rev use the index |
| 3 | Rewrite query as `email_rev LIKE reverse('@gmail.com') \ | \ |
| 4 | Planner picks index range scan from {% raw %}moc.liamg@ to moc.liamg{
|
reads only matching rows |
Output:
| Metric | Before (LIKE '%@gmail.com') |
After (reverse trick) |
|---|---|---|
| Plan | sequential scan | index range scan |
| Rows read | 10,000,000 | ≈ 1,200 |
| Cost | 220,000 | 18 |
| Wall time (p50) | 8.4 s | 6 ms |
Why this works — concept by concept:
-
Leading-wildcard problem —
LIKE '%suffix'can never use a B-tree on the original column because B-tree indexes are sorted by prefix. The planner has no choice but to read every row. -
Reverse-string trick — by indexing
reverse(email), the suffix'@gmail.com'becomes the prefix'moc.liamg@'of the reversed string, which the B-tree can range-scan. -
Generated column — Postgres maintains
email_revautomatically on every insert / update; no application code touches it. -
Alternative: trigram index — for general substring matching (
%foo%, not just%suffix), the right answer ispg_trgm+ GIN instead of the reverse trick. The reverse trick beats trigrams only when the pattern is strictly a suffix. - Cost — disk = O(N) extra (one reversed copy); index = O(N log N) build; query = O(log N + matches) range scan.
SQL
Topic — pattern matching
SQL pattern-matching problems (LIKE, REGEXP, wildcards)
2. LIKE basics — % and _ wildcards plus escape characters
like in sql is two wildcards, one optional escape, and a single performance rule — anchor or pay
The mental model in one line: LIKE is the SQL comparison operator that matches a string against a pattern containing exactly two wildcards — % for zero-or-more characters and _ for exactly-one character — plus an optional ESCAPE clause for matching the wildcards literally. Once you say "anchor the pattern to the start whenever possible," half of the sql like operator interview surface collapses to obvious answers.
The two wildcards.
-
%— matches zero or more characters of any kind.'abc%'matches'abc','abcd','abc123';'%abc'matches anything ending in'abc';'%abc%'matches anything containing'abc'. -
_— matches exactly one character.'a_c'matches'abc','axc','a1c'; but not'ac'(zero chars) or'abbc'(two). Combined:'a__c'matches'abbc','a12c';'abc_'matches any four-character string starting with'abc'.
The three shapes and their performance.
-
col LIKE 'prefix%'— anchored prefix. Uses a B-tree index range scan. The fastest shape; this is the one you want. -
col LIKE '%suffix'— anchored suffix. Full scan on a B-tree-indexed column. Fix: storereverse(col)and index it. -
col LIKE '%substring%'— unanchored contains. Full scan. Fix:pg_trgm+ GIN (Postgres), FULLTEXT (MySQL), search optimisation (Snowflake),SEARCHfunction (BigQuery), or external Elasticsearch.
Escape characters.
- The wildcards themselves are not literal characters. To match a literal
%or_you must escape it. - Postgres / MySQL / SQL Server / Snowflake / BigQuery all support the standard
ESCAPEclause:WHERE col LIKE '50\%' ESCAPE '\'matches the literal four-character string'50%'. - You can pick any single character as the escape:
LIKE '50!%' ESCAPE '!'is equally valid and useful when\already appears in the data.
Case sensitivity by default.
-
Postgres, Snowflake, BigQuery —
LIKEis case-sensitive. UseILIKE(Postgres / Snowflake) orLOWER(col) LIKE LOWER(pattern)(portable). -
MySQL —
LIKEis case-insensitive by default because of the defaultutf8mb4_general_cicollation. Switch to a_cs(case-sensitive) collation if you need case-sensitive matching. -
SQL Server — depends on the column's collation.
Latin1_General_CI_ASis case-insensitive;Latin1_General_CS_ASis case-sensitive. You can override per-query withCOLLATE.
Worked example — match phone numbers with a mix of _ and %
Detailed explanation. Phone numbers come in two shapes: a strict ten-digit format (555-123-4567) and a free-form one with optional country code (+1 555 123 4567). Picking the right wildcard mix is a classic like in sql interview probe — it tests whether you remember that _ is one char and % is zero-or-more.
Question. Write a LIKE pattern that matches the format XXX-XXX-XXXX exactly (no country code, no parentheses), where each X is any character. Then write a second pattern that matches the same format optionally preceded by +1 (a plus sign, the digit one, and a space).
Input (sample rows).
| id | phone |
|---|---|
| 1 | 555-123-4567 |
| 2 | +1 555-123-4567 |
| 3 | 555 123 4567 |
| 4 | (555) 123-4567 |
| 5 | 555-1234 |
| 6 | 555-123-4567-ext |
Code.
-- Pattern 1: strict XXX-XXX-XXXX
SELECT id, phone FROM contacts
WHERE phone LIKE '___-___-____';
-- Pattern 2: optional "+1 " prefix
SELECT id, phone FROM contacts
WHERE phone LIKE '___-___-____'
OR phone LIKE '+1 ___-___-____';
Step-by-step explanation.
-
Pattern 1 uses 12 underscores total —
___-___-____is exactly three underscores, a literal hyphen, three underscores, a literal hyphen, four underscores.LIKEmatches the whole string, so this pattern only matches strings of length 12 in that exact shape. -
Row 1 (
555-123-4567) — 12 chars in the right shape → matches. -
Row 2 (
+1 555-123-4567) — 15 chars → does not match pattern 1. Pattern 2 explicitly includes the+1prefix and matches. -
Row 3 (
555 123 4567) — spaces instead of hyphens → does not match either pattern. -
Row 5 (
555-1234) — only 8 chars → does not match. -
Row 6 (
555-123-4567-ext) — 16 chars, extra suffix → does not match either pattern (the patterns are anchored at both ends sinceLIKEmatches the whole string).
Output.
| id | phone | Pattern 1 | Pattern 2 |
|---|---|---|---|
| 1 | 555-123-4567 | match | match |
| 2 | +1 555-123-4567 | — | match |
| 3 | 555 123 4567 | — | — |
| 4 | (555) 123-4567 | — | — |
| 5 | 555-1234 | — | — |
| 6 | 555-123-4567-ext | — | — |
Rule of thumb. _ is the right tool when you need exactly N characters in this shape. Reach for % when you accept any number of characters here — including zero. Combining the two is how LIKE expresses constraints like "10 digits with hyphens" without dropping into regex.
Worked example — escape a literal % and _ in a marketing campaigns table
Detailed explanation. A marketing analytics table has a campaign_name column with values like "Summer Sale 50% Off", "Q3_Launch", and "Black_Friday 25%". The interviewer asks for two queries: one that finds every campaign whose name contains a literal percentage sign, and one that finds every campaign whose name starts with "Q3_" (underscore is the literal third character, not a wildcard).
Question. Write the two queries on Postgres. Then show the equivalent on MySQL where the default escape character is \. Then write a third query that finds every campaign with both a literal % and a literal _ somewhere in the name.
Input (sample rows).
| id | campaign_name |
|---|---|
| 1 | Summer Sale 50% Off |
| 2 | Q3_Launch |
| 3 | Black_Friday 25% |
| 4 | Spring Renewal |
| 5 | Year_End |
| 6 | Holiday 100% Match |
Code.
-- 1. Names containing a literal % — designate `!` as the escape char
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE '%!%%' ESCAPE '!';
-- 2. Names starting with literal "Q3_" — escape the underscore
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE 'Q3!_%' ESCAPE '!';
-- 3. Names containing both a literal % and a literal _
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE '%!%%' ESCAPE '!'
AND campaign_name LIKE '%!_%' ESCAPE '!';
-- Same on MySQL with default backslash escape (works on Postgres too)
SELECT id, campaign_name FROM campaigns
WHERE campaign_name LIKE '%\%%' ESCAPE '\\';
Step-by-step explanation.
-
LIKE '%!%%' ESCAPE '!'— theESCAPE '!'clause designates!as the escape character. The pattern reads as: any number of characters, then a literal%(because!%is the escape sequence), then any number of characters. -
LIKE 'Q3!_%' ESCAPE '!'—!_is the escape for a literal underscore. So the pattern reads asQ3followed by a literal_followed by any number of characters. Row 2 (Q3_Launch) matches; row 5 (Year_End) doesn't (wrong prefix). -
Combined query — the two
LIKEclauses are independent; both must hold. Row 3 (Black_Friday 25%) matches both (it has a_and a%). -
MySQL with backslash —
'%\%%' ESCAPE '\\'is the same pattern. The double backslash in the SQL string is the standard string-literal escape; the resulting four-character pattern is% \ % %, whichLIKEreads as%, escape-%,%. -
Why pick
!over\— when the data itself contains backslashes (file paths, JSON-encoded strings),\collides.!(or any unused character) makes the intent unambiguous.
Output.
| Query | Matches |
|---|---|
1 — literal %
|
1, 3, 6 |
2 — starts with literal Q3_
|
2 |
3 — both literal % and literal _
|
3 |
Rule of thumb. When your data may contain backslashes, pick a non-data character (!, |, #) as the ESCAPE. The ESCAPE clause works the same way on every major dialect — Postgres, MySQL, SQL Server, Snowflake, and BigQuery — so the pattern you write here ports cleanly.
SQL interview question on building an anchored search
A senior probe sounds like: "We log every page view into a url column. Write a query that returns the count of page views per top-level section, where a section is the slug between the first and second slashes — /checkout/, /cart/, /account/, etc. Solve it with LIKE only, no regex." The interviewer is checking whether you can build anchored patterns that compose into a GROUP BY.
Solution Using anchored LIKE patterns with explicit OR
SELECT
CASE
WHEN url LIKE '/checkout/%' THEN 'checkout'
WHEN url LIKE '/cart%' THEN 'cart'
WHEN url LIKE '/account/%' THEN 'account'
WHEN url LIKE '/search%' THEN 'search'
WHEN url LIKE '/blog/%' THEN 'blog'
ELSE 'other'
END AS section,
COUNT(*) AS pageviews
FROM events
WHERE event_type = 'page_view'
GROUP BY 1
ORDER BY pageviews DESC;
Step-by-step trace.
| url | which LIKE matches | section |
|---|---|---|
| /checkout/start | '/checkout/%' |
checkout |
| /checkout/done | '/checkout/%' |
checkout |
| /cart |
'/cart%' (no trailing / required) |
cart |
| /cart/items | '/cart%' |
cart |
| /account/settings | '/account/%' |
account |
| /search?q=widgets |
'/search%' (no trailing slash) |
search |
| /blog/post-1 | '/blog/%' |
blog |
| /unknown-page | none — fallback | other |
Output:
| section | pageviews |
|---|---|
| checkout | 1,820,331 |
| cart | 1,402,557 |
| blog | 880,210 |
| search | 612,884 |
| account | 488,712 |
| other | 312,401 |
Why this works — concept by concept:
-
Anchored prefix patterns — every
LIKEclause starts with a literal prefix, so the database can use a B-tree index onurland do a range scan per branch. No leading wildcards anywhere. -
CASE WHEN ordering — the first matching branch wins, which lets you put the more specific patterns earlier (e.g.
/cart/itemsmatches/cart%before any fallback). -
Trailing
/vs no/—'/cart%'(no slash) catches both/cart(the index page) and/cart/items(a sub-page).'/cart/%'(with slash) would miss/cartitself. Pick the shape that matches the URL convention. -
Bucket fallback — the
ELSE 'other'branch ensures every row is counted somewhere. Without it,GROUP BYwould silently drop rows whose URL doesn't match any pattern. -
Cost — query = O(N) full scan since we touch every row; per-row pattern match = O(L) where L is the URL length. The B-tree index helps if the planner can rewrite the
CASEinto per-branch range scans (Postgres 12+ does this for trivial CASEs; older versions may need a manualUNION ALL).
SQL
Topic — pattern matching
LIKE pattern-matching problems (SQL)
3. ILIKE and case-insensitive matching across dialects
sql like operator becomes a dialect quiz the moment "case-insensitive" is mentioned
The mental model in one line: ILIKE is the Postgres / Snowflake convenience operator that does what LIKE does, but case-insensitively; MySQL gives you case-insensitive LIKE for free via the default collation; SQL Server controls case sensitivity via column collation; BigQuery makes you write a regex with the (?i) flag. Five dialects, five answers, one production decision: use the dialect's native shape whenever possible.
The five-dialect matrix.
-
Postgres —
col ILIKE '%abc%'is a native operator that matches case-insensitively. Same semantics asLIKE, just no respect for case. Backed by~~*under the hood. -
MySQL —
col LIKE '%abc%'is already case-insensitive by default because the default collation (utf8mb4_0900_ai_ciin MySQL 8) is case-insensitive. Switch toLIKE BINARYor a_bin/_cscollation if you need case-sensitive matching. -
SQL Server —
col LIKE '%abc%'defers to the column's collation. DefaultLatin1_General_CI_ASis case-insensitive. Override per-query:col LIKE '%abc%' COLLATE Latin1_General_CI_AI. There is noILIKEkeyword. -
Snowflake —
col ILIKE '%abc%'is a native operator, same semantics as Postgres. Also supports theLIKE_ANYvariants for multi-pattern matching. -
BigQuery — no
ILIKEkeyword. UseREGEXP_CONTAINS(col, r'(?i)abc')with the inline(?i)flag, orLOWER(col) LIKE LOWER('%abc%')as a portable fallback.
Portable workaround.
-- Works on every dialect, kills any index on `col`
WHERE LOWER(col) LIKE LOWER('%abc%')
This pattern is universally understood but expensive — applying LOWER() to the column prevents any B-tree index from being used. A functional index on LOWER(col) (Postgres, Oracle, MySQL 8.0.13+) restores the index path; SQL Server uses computed columns.
Performance — ILIKE vs LIKE.
- On Postgres,
ILIKEis typically 10–20 % slower thanLIKEbecause the matcher does extra collation work per byte. The difference is small on short strings but adds up onTEXTcolumns with multi-byte UTF-8. - On Snowflake, the difference is negligible at the storage layer (micro-partitions are scanned columnarly anyway).
- On MySQL,
LIKEwith a case-insensitive collation is the fast path;LIKE BINARY(case-sensitive) is slightly faster on ASCII columns but identical on most production workloads. - On SQL Server,
COLLATEclauses insideWHEREcan disable index seeks if the chosen collation differs from the index's collation. Watch this in execution plans.
Locale and Unicode gotchas.
- Turkish dotless
ıvs dottedi, Germanßvsss, the various Greek accents — the right collation matters when international data flows in. - Postgres
ILIKEuses the database default collation; override withCOLLATE "C"(byte-wise) orCOLLATE "und-x-icu"for ICU-driven Unicode rules. - Snowflake's
ILIKEis Unicode-aware out of the box.
Worked example — same case-insensitive contains query across five dialects
Detailed explanation. A common practice question is: "Find every customer whose name contains 'smith' in any capitalisation. Write the query for Postgres, MySQL, SQL Server, Snowflake, and BigQuery." The query is trivial — what's tested is whether you know each dialect's native shape.
Question. Given a customers table with a full_name column, write five queries that return the same result — every row whose full_name contains 'smith' in any case — one per dialect. Then state the expected performance order.
Input (sample rows).
| id | full_name |
|---|---|
| 1 | Alice SMITH |
| 2 | bob smithers |
| 3 | Carol Smyth |
| 4 | Dan O'Smith |
| 5 | Eve Doe |
Code.
-- Postgres
SELECT id, full_name FROM customers
WHERE full_name ILIKE '%smith%';
-- MySQL (default case-insensitive collation — LIKE is enough)
SELECT id, full_name FROM customers
WHERE full_name LIKE '%smith%';
-- SQL Server (force CI collation in case the column is CS)
SELECT id, full_name FROM customers
WHERE full_name LIKE '%smith%' COLLATE Latin1_General_CI_AS;
-- Snowflake
SELECT id, full_name FROM customers
WHERE full_name ILIKE '%smith%';
-- BigQuery (no ILIKE; use REGEXP_CONTAINS with (?i))
SELECT id, full_name FROM customers
WHERE REGEXP_CONTAINS(full_name, r'(?i)smith');
Step-by-step explanation.
-
Postgres
ILIKEis a single token; the planner translates it to the internal~~*operator. No collation argument needed. -
MySQL
LIKEpicks up case-insensitivity fromutf8mb4_0900_ai_ci. If the column is bound to a_cscollation, the same query becomes case-sensitive — you'd needCOLLATE utf8mb4_general_ciper-query. -
SQL Server infers case from the column collation. If you can't be sure, the
COLLATEclause forces it. Be aware: explicitCOLLATEinWHEREmay disable index seeks. -
Snowflake
ILIKEis the same shape as Postgres. Snowflake also offersILIKE_ANY('%smith%', '%jones%')for multi-pattern OR — convenient when the pattern list grows. -
BigQuery has no
ILIKE. The cleanest substitute isREGEXP_CONTAINS(col, r'(?i)pattern')because ther'(?i)...'raw string with inline flag is idiomatic. A portable fallback isLOWER(col) LIKE LOWER('%smith%').
Output.
| Dialect | Query | Matches |
|---|---|---|
| Postgres | ILIKE '%smith%' |
1, 2, 4 |
| MySQL |
LIKE '%smith%' (default CI) |
1, 2, 4 |
| SQL Server | LIKE … COLLATE … CI_AS |
1, 2, 4 |
| Snowflake | ILIKE '%smith%' |
1, 2, 4 |
| BigQuery | REGEXP_CONTAINS(col, r'(?i)smith') |
1, 2, 4 |
Rule of thumb. Use the dialect's native shape whenever possible (ILIKE on PG/SF, LIKE on MySQL, COLLATE on SQL Server, REGEXP_CONTAINS on BigQuery). The portable LOWER(col) LIKE LOWER(pattern) workaround works everywhere but pays the cost of a per-row function call and disables ordinary B-tree indexes.
Worked example — LIKE_ANY and ILIKE_ANY for multi-pattern OR on Snowflake
Detailed explanation. A common refactor target is a WHERE col ILIKE '%foo%' OR col ILIKE '%bar%' OR col ILIKE '%baz%' OR … chain that grows alongside a hard-coded list of brand names, error codes, or category keywords. Snowflake ships LIKE_ANY / ILIKE_ANY to compress the chain into one operator with a comma-separated pattern list.
Question. A support_tickets table needs to flag every row whose subject mentions any of refund, chargeback, dispute, or cancellation — case-insensitive. Show the naive chained OR ILIKE form, the Snowflake ILIKE_ANY form, the Postgres ANY (ARRAY[…]) form, and the regex alternation form. Compare them on readability and performance.
Input (sample rows).
| id | subject |
|---|---|
| 1 | Need refund for order #4711 |
| 2 | Chargeback notice received |
| 3 | Unable to log in |
| 4 | DISPUTE: missing item |
| 5 | Order cancellation request |
| 6 | When will my order ship? |
Code.
-- 1. Naive chained OR — works everywhere; verbose
SELECT id, subject FROM support_tickets
WHERE subject ILIKE '%refund%'
OR subject ILIKE '%chargeback%'
OR subject ILIKE '%dispute%'
OR subject ILIKE '%cancellation%';
-- 2. Snowflake — ILIKE_ANY
SELECT id, subject FROM support_tickets
WHERE subject ILIKE ANY ('%refund%', '%chargeback%', '%dispute%', '%cancellation%');
-- 3. Postgres — ANY (ARRAY[...]) with ILIKE
SELECT id, subject FROM support_tickets
WHERE subject ILIKE ANY (ARRAY['%refund%', '%chargeback%', '%dispute%', '%cancellation%']);
-- 4. Regex alternation — portable on Postgres / MySQL / Snowflake / BigQuery
SELECT id, subject FROM support_tickets
WHERE subject ~* 'refund|chargeback|dispute|cancellation'; -- Postgres
-- WHERE REGEXP_CONTAINS(subject, r'(?i)refund|chargeback|dispute|cancellation'); -- BigQuery
Step-by-step explanation.
-
Naive OR form — explicit, readable, and supported on every dialect. The planner evaluates each branch separately; if
pg_trgmGIN is present onsubject, each branch is a bitmap index scan and the planner OR-merges the bitmaps. -
Snowflake
ILIKE ANY— one operator, one pattern list, one scan. Internally optimised — the engine compiles all patterns into a single matcher. -
Postgres
ANY (ARRAY[…])— the SQL-standard shape using array constants. Equivalent in semantics to the naive OR form; whether the planner is smart enough to use a single trigram scan instead of OR-merging is version-dependent (PG 14+ usually does). -
Regex alternation —
'refund|chargeback|dispute|cancellation'is one regex;~*is the case-insensitive match operator. Compiles to a single state machine — efficient at runtime. -
Performance ordering — on a 50 M row Postgres table with
pg_trgm, all four forms are within 10 % of each other. On a Snowflake table with search optimisation,ILIKE ANYis typically fastest because it bundles the pattern list. On a dialect without text indexes, all four are full scans of equal cost.
Output.
| Form | Matches | Readability |
|---|---|---|
Naive chained OR ILIKE
|
1, 2, 4, 5 | medium |
Snowflake ILIKE ANY
|
1, 2, 4, 5 | high |
Postgres ANY (ARRAY[…])
|
1, 2, 4, 5 | high |
| Regex alternation | 1, 2, 4, 5 | medium |
Rule of thumb. For more than three patterns, prefer ILIKE ANY (Snowflake) / ANY (ARRAY[…]) (Postgres) / regex alternation (everywhere else). The naive OR chain quickly becomes unreadable; the consolidated forms are easier to refactor when the pattern list comes from a config file or a IN-style sub-query.
SQL interview question on case-insensitive search at scale
A typical scaling-focused probe: "Our customers.full_name column has 50 million rows. The query LOWER(full_name) LIKE LOWER('%smith%') runs in 45 seconds. How do you make it sub-second without changing the SQL contract for callers?" The interviewer wants you to combine an expression index with an ILIKE rewrite (or a trigram index — covered in §5).
Solution Using a functional index on LOWER(full_name) plus an anchored rewrite
-- 1. One-time: create a functional index on the lowercased column
CREATE INDEX idx_customers_name_lower
ON customers (LOWER(full_name) text_pattern_ops);
-- 2. Rewrite the contains query as an anchored pair where possible.
-- If the contract requires "contains," install pg_trgm + GIN instead (§5).
-- For "starts-with" or "ends-with" you can use the functional index directly:
SELECT id, full_name FROM customers
WHERE LOWER(full_name) LIKE LOWER('smith%'); -- starts-with → index range scan
-- 3. For genuine "contains" semantics, the functional index alone is not enough.
-- The right answer is pg_trgm + GIN (§5):
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_customers_name_trgm
ON customers USING GIN (full_name gin_trgm_ops);
SELECT id, full_name FROM customers
WHERE full_name ILIKE '%smith%'; -- now uses the GIN index
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | CREATE INDEX … (LOWER(full_name) text_pattern_ops) |
Postgres maintains an expression index keyed on the lowercased name |
| 2 | Rewrite '…smith…' → 'smith%' where the contract allows |
starts-with patterns can range-scan the index |
| 3 | If contract requires '%smith%' (genuine contains) → install pg_trgm GIN |
ILIKE '%smith%' now uses the GIN index via trigram lookup |
| 4 | Planner picks the lighter index | wall-time drops 1000× |
Output:
| Variant | Plan | Rows read | Wall time (p50) |
|---|---|---|---|
LOWER(full_name) LIKE LOWER('%smith%') (no index) |
seq scan | 50,000,000 | 45 s |
LOWER(full_name) LIKE 'smith%' + functional B-tree |
index range scan | 1,800 | 8 ms |
full_name ILIKE '%smith%' + pg_trgm GIN |
bitmap index scan | 1,800 | 14 ms |
Why this works — concept by concept:
-
Functional index on
LOWER(col)— turns theLOWER(col) LIKE 'smith%'shape into an index range scan because the index key is precomputed. -
text_pattern_ops opclass — required for
LIKEindex usage ontextcolumns when the database default collation is notC. Without it, the planner can't prove the range scan is safe. -
Trigram index for contains —
pg_trgmbreaks%smith%into 5 trigrams (smi,mit,ith, ...) and looks each one up in a GIN inverted index. Covered in depth in §5. - SQL contract preservation — adding indexes does not change the query result, only the plan. Callers see the same rows, faster.
- Cost — disk = O(N) extra per index; build = O(N log N); query drops from O(N) full scan to O(log N + matches) for the prefix case, O(matches) for the trigram case.
SQL
Topic — string functions
Case-insensitive search + string function problems
4. REGEXP, SIMILAR TO, and regex functions across dialects
sql regexp unlocks character classes, quantifiers, alternation, and capture groups that LIKE cannot express
The mental model in one line: regex in SQL is the same PCRE-flavoured language you know from Python and JavaScript, surfaced through dialect-specific operators (~, ~*) or functions (REGEXP_LIKE, REGEXP_CONTAINS, REGEXP_MATCH, REGEXP_EXTRACT, REGEXP_REPLACE, REGEXP_SUBSTR) — and the price you pay is no index, always a full scan, unless you pair the regex with a separate text-search index. Reach for regex when LIKE can't express the constraint, not before.
The dialect surface.
-
Postgres — operators
~(match),~*(case-insensitive match),!~and!~*(negated). Functionsregexp_match,regexp_matches,regexp_replace,regexp_split_to_array,regexp_split_to_table. Backed by the PCRE-compatible engine baked into the server. -
MySQL —
REGEXPandRLIKE(synonyms) for match;REGEXP_LIKE,REGEXP_SUBSTR,REGEXP_REPLACE,REGEXP_INSTRfrom MySQL 8.0+. Older versions only supportedREGEXP(POSIX BRE). -
SQL Server — no native regex. Workarounds: complex
LIKEpatterns with character classes (LIKE '[A-Z]%'), CLR functions written in C#, orSTRING_SPLITplus row-level logic. -
Snowflake —
REGEXP,RLIKE,REGEXP_LIKE,REGEXP_SUBSTR,REGEXP_REPLACE,REGEXP_INSTR,REGEXP_COUNT. PCRE-flavoured engine; supports(?i)inline flags. -
BigQuery —
REGEXP_CONTAINS,REGEXP_EXTRACT,REGEXP_EXTRACT_ALL,REGEXP_REPLACE,REGEXP_INSTR. RE2 engine — no back-references, no lookahead. Performance is linear; expressiveness is slightly reduced.
The seven regex primitives every SQL interview probes.
-
Anchors —
^start,$end. Critical for "the whole string matches X." -
Character classes —
[A-Z],[^@](not at-sign),\d,\w,\s. -
Quantifiers —
?zero-or-one,*zero-or-more,+one-or-more,{n,m}between n and m. -
Alternation —
(cat|dog)matches either. -
Capture groups —
( … )captures a slice for extraction. -
Back-references —
\1,\2inside the pattern to match captured text again (not supported on BigQuery's RE2). -
Inline flags —
(?i)case-insensitive,(?s)dot-matches-newline, etc.
SIMILAR TO — the SQL-standard half-breed.
- Postgres also supports
SIMILAR TO, defined by the SQL standard. Syntax is somewhere between LIKE and regex: it uses%and_as wildcards (like LIKE) but also accepts|,*,+,?,( … )(like regex). - In practice almost no one uses
SIMILAR TO. Either you need regex (use~) or you don't (useLIKE). Worth recognising on a code review; not worth writing.
Performance shape — regex is always a full scan unless paired with an external index.
- A B-tree on the column cannot accelerate a regex predicate (except for trivial cases the planner can rewrite into anchored LIKE).
- Postgres'
pg_trgmextension can index regex predicates that contain a substantial literal substring (pg_trgm'stext_opsGIN index supports~,~*,LIKE,ILIKE). - MySQL FULLTEXT and BigQuery
SEARCHare not regex-aware — they tokenise and index whole words.
Worked example — extract the domain from an email column
Detailed explanation. A daily reporting query needs the email domain in a new column for an aggregation by provider — gmail.com, yahoo.com, acme.io. The choice is between five layers: SUBSTR + INSTR, SPLIT_PART, REGEXP_EXTRACT, REGEXP_MATCH, or a chain of RIGHT(POSITION()) calls. Regex is the cleanest single-call answer.
Question. Given a users.email column, write a query that returns email, username (before @), and domain (after @) — using regex capture groups on Postgres, MySQL, Snowflake, and BigQuery. Then convert the dialect-specific functions into a single portable shape.
Input (sample rows).
| id | |
|---|---|
| 1 | alice@example.com |
| 2 | bob.jones@acme.io |
| 3 | carol+promo@gmail.com |
| 4 | dan@subdomain.example.co.uk |
| 5 | not-an-email |
Code.
-- Postgres
SELECT
email,
(regexp_match(email, '^([^@]+)@(.+)$'))[1] AS username,
(regexp_match(email, '^([^@]+)@(.+)$'))[2] AS domain
FROM users;
-- MySQL 8+
SELECT
email,
REGEXP_SUBSTR(email, '^[^@]+') AS username,
REGEXP_SUBSTR(email, '(?<=@).+$') AS domain
FROM users;
-- Snowflake
SELECT
email,
REGEXP_SUBSTR(email, '^[^@]+', 1, 1) AS username,
REGEXP_SUBSTR(email, '@(.+)$', 1, 1, 'e', 1) AS domain
FROM users;
-- BigQuery
SELECT
email,
REGEXP_EXTRACT(email, r'^([^@]+)@') AS username,
REGEXP_EXTRACT(email, r'@(.+)$') AS domain
FROM users;
Step-by-step explanation.
-
Pattern
^([^@]+)@(.+)$— anchored at both ends. Group 1 is "one or more non-@characters" (the username). Then a literal@. Group 2 is "one or more of any character" (the domain). -
Row 1 (
alice@example.com) — Group 1 =alice, Group 2 =example.com. -
Row 2 (
bob.jones@acme.io) — Group 1 =bob.jones(dots are allowed in[^@]), Group 2 =acme.io. -
Row 3 (
carol+promo@gmail.com) — Group 1 =carol+promo(+is a literal inside[^@]), Group 2 =gmail.com. -
Row 4 (
dan@subdomain.example.co.uk) — Group 1 =dan, Group 2 =subdomain.example.co.uk— multi-level domain captured cleanly. -
Row 5 (
not-an-email) — no@→ no match →NULLin both username and domain columns. -
Why the four dialects look different — Postgres'
regexp_matchreturns a text array; MySQL'sREGEXP_SUBSTRreturns a string and uses lookbehind for the domain; Snowflake'sREGEXP_SUBSTRhas a 6-argument form where the 6th argument is the capture-group index; BigQuery'sREGEXP_EXTRACTreturns the first capture group directly.
Output.
| username | domain | |
|---|---|---|
| alice@example.com | alice | example.com |
| bob.jones@acme.io | bob.jones | acme.io |
| carol+promo@gmail.com | carol+promo | gmail.com |
| dan@subdomain.example.co.uk | dan | subdomain.example.co.uk |
| not-an-email | NULL | NULL |
Rule of thumb. Extraction queries (domain from email, slug from URL, prefix from ticker symbol) are where regex pays for itself. Pure validation queries (is this a valid email shape?) are border cases — email LIKE '%@%.%' is 90 % as good as a strict regex and is supported on every dialect, including SQL Server.
Worked example — back-references for finding duplicate words in a comment column
Detailed explanation. Back-references — \1, \2 inside the same regex — let you match repeated content. They are not supported in BigQuery's RE2 engine, but Postgres, MySQL, and Snowflake all have them. A common cleaning task is "find every row whose comment column contains the same word twice in a row" ("the the", "on on", "is is") — a classic copy-paste typo that's easy to spot with back-references and very awkward to spot without.
Question. Write a Postgres query that returns every row whose comment contains a doubled word (case-insensitive), and the duplicated word itself. Then explain why BigQuery cannot run the same query and what the workaround looks like.
Input (sample rows).
| id | comment |
|---|---|
| 1 | The the user reported a bug. |
| 2 | Please please double check. |
| 3 | The user reported a bug. |
| 4 | I I think we need to investigate. |
| 5 | Working as expected. |
Code.
-- Postgres — \1 back-reference + (?i) inline case flag
SELECT
id,
comment,
(regexp_match(comment, '(?i)\m(\w+)\s+\1\m'))[1] AS doubled_word
FROM notes
WHERE comment ~* '\m(\w+)\s+\1\m';
-- Snowflake — same pattern, double-escape backslashes for the SQL string
SELECT
id,
comment,
REGEXP_SUBSTR(comment, '(?i)\\b(\\w+)\\s+\\1\\b', 1, 1, 'e', 1) AS doubled_word
FROM notes
WHERE REGEXP_LIKE(comment, '(?i).*\\b(\\w+)\\s+\\1\\b.*');
-- BigQuery — RE2 engine does NOT support \1 back-references.
-- Workaround: enumerate doubled-word candidates in code, or use SPLIT + zip-pair logic.
WITH tokens AS (
SELECT id, comment, w, pos
FROM notes,
UNNEST(REGEXP_EXTRACT_ALL(LOWER(comment), r'\w+')) AS w WITH OFFSET pos
)
SELECT a.id, n.comment, a.w AS doubled_word
FROM tokens a
JOIN tokens b ON a.id = b.id AND b.pos = a.pos + 1 AND a.w = b.w
JOIN notes n ON n.id = a.id;
Step-by-step explanation.
-
Pattern
\m(\w+)\s+\1\m—\mis a word-start anchor (Postgres);(\w+)captures one or more word characters;\s+matches whitespace;\1matches the same text the first group captured;\mends at a word boundary. -
(?i)inline flag — makes the match case-insensitive."The the"and"THE the"both match. -
Row 1 (
"The the user…") —(\w+)captures"The",\s+matches the space,\1then needs"The"again (case-insensitive) — finds"the". Match. -
Row 4 (
"I I think…") —(\w+)captures"I",\s+matches the space,\1matches the next"I". Match. -
BigQuery cannot run this because RE2 explicitly forbids back-references (they make worst-case matching exponential — RE2 is built for linear-time guarantees). The workaround tokenises the string, joins each token to its neighbour at offset
+1, and filters where the tokens are equal. - Engine difference matters in interviews — saying "RE2 doesn't support back-references" is the senior signal. Don't promise back-references on BigQuery; promise a tokenise-and-join pattern instead.
Output.
| id | comment | doubled_word |
|---|---|---|
| 1 | The the user reported a bug. | the |
| 2 | Please please double check. | please |
| 4 | I I think we need to investigate. | i |
Rule of thumb. Reach for back-references whenever the match shape depends on "the same thing appearing twice" — duplicate words, repeated digits, repeated tags. If you're targeting BigQuery, tokenise into a table and join; on Postgres / MySQL / Snowflake the regex form is one line.
SQL interview question on validating PII patterns
A senior probe sounds like: "Compliance flagged that one of our analytics tables may contain SSNs in a free-text comment column. Write a query that lists every row whose comment contains a string that looks like a US SSN (XXX-XX-XXXX, where X is a digit), and the matched substring." The interviewer is checking three things: character class ([0-9] vs \d), quantifier ({N}), and capture / extract semantics.
Solution Using REGEXP_LIKE + REGEXP_SUBSTR (Snowflake) and the equivalent Postgres pattern
-- Snowflake
SELECT
id,
comment,
REGEXP_SUBSTR(comment, '\\b[0-9]{3}-[0-9]{2}-[0-9]{4}\\b') AS ssn_match
FROM customer_notes
WHERE REGEXP_LIKE(comment, '.*\\b[0-9]{3}-[0-9]{2}-[0-9]{4}\\b.*');
-- Postgres (operator + function form)
SELECT
id,
comment,
(regexp_match(comment, '\m[0-9]{3}-[0-9]{2}-[0-9]{4}\M'))[1] AS ssn_match
FROM customer_notes
WHERE comment ~ '\m[0-9]{3}-[0-9]{2}-[0-9]{4}\M';
Step-by-step trace.
| comment | regex find | match? | ssn_match |
|---|---|---|---|
| Customer wants refund on order 123-45-6789 | 123-45-6789 |
yes | 123-45-6789 |
| Tracking number 999-12345-67 | shape doesn't fit \d{3}-\d{2}-\d{4}
|
no | NULL |
| Address is 100-12-3456 Main St | 100-12-3456 |
yes | 100-12-3456 |
| Phone (555) 123-4567 | no SSN-shaped match | no | NULL |
| ID 12-345-6789 | hyphens but wrong digit counts | no | NULL |
Output:
| id | comment (snippet) | ssn_match |
|---|---|---|
| 4711 | Customer wants refund on order 123-45-6789 | 123-45-6789 |
| 4719 | Address is 100-12-3456 Main St | 100-12-3456 |
| 4733 | Old record: 555-44-3322 — please redact | 555-44-3322 |
| 4760 | Backup file 999-88-7766 attached | 999-88-7766 |
Why this works — concept by concept:
-
Character class + quantifier —
[0-9]{3}-[0-9]{2}-[0-9]{4}is impossible to express in plainLIKE. Regex is the right layer. -
Word boundaries (
\b/\m/\M) — protect against false positives likeXX-123-45-6789-XXwhere the digits are part of a longer code. Snowflake uses\b; Postgres uses\m(start) and\M(end). -
REGEXP_LIKEfor the filter,REGEXP_SUBSTRfor the extract — paired like this you can both flag the row and return the offending substring for the compliance report. -
Backslash escaping in SQL strings —
'\\b'in Snowflake doubles the backslash because SQL string literals consume one level of escaping. Postgres'~operator andregexp_matchaccept single-backslash escapes in single-quoted patterns ('\m'). - Cost — full scan of the comment column; for a 50 M row table this is typically a 30–60 s scheduled job, not a sub-second user query. For the latter, route compliance scans to a dedicated index or an external text engine.
SQL
Topic — pattern matching
REGEXP + capture group problems (SQL)
5. Performance — trigram indexes, GIN, and leading wildcards
sql pattern matching becomes a system-design conversation the moment "make LIKE '%abc%' fast" is asked
The mental model in one line: a B-tree index on a string column accelerates LIKE 'prefix%' (range scan), accelerates nothing for LIKE '%suffix' or LIKE '%substring%', and that gap is what trigram indexes, FULLTEXT, search-optimisation services, and external search engines exist to close. Senior interviews on sql pattern matching spend most of their time here.
Why a B-tree fails on leading wildcards.
- A B-tree index is a sorted structure keyed by the column's prefix. The planner can find every row where
col >= 'abc'andcol < 'abd'inO(log N + matches)— that's the index range scan that powersLIKE 'abc%'. - For
LIKE '%abc%', there is no usable prefix — the matching rows are scattered everywhere in the sort order. The planner falls back to a sequential scan of every row.
Postgres — pg_trgm + GIN is the senior answer.
-
pg_trgmis a Postgres extension that breaks each string into trigrams — overlapping three-character windows.'cathedral'becomes' c',' ca','cat','ath','the','hed','edr','dra','ral','al ','l '. - A GIN (Generalized Inverted iNdex) index stores every trigram once, mapped to the list of row ids that contain it.
- For
LIKE '%edr%', the planner extracts the trigram'edr'from the pattern, looks it up in the GIN index, and returns the candidate row ids — typically a tiny fraction of the table. - For
ILIKE '%abc%', the same shape works because the trigrams are stored case-insensitively when the index usesgin_trgm_ops(Postgres normalises to lowercase internally). -
Trigger threshold — patterns shorter than 3 characters cannot use the trigram index because there are no full trigrams to look up.
LIKE '%ab%'either falls back to a sequential scan or matches every row that contains the trigrams'_ab'and'ab_'recombined.
MySQL — FULLTEXT for whole-word search.
-
CREATE FULLTEXT INDEX idx_notes_body ON notes(body)builds an inverted index of whole words, not trigrams. - Search shape:
WHERE MATCH(body) AGAINST('timeout' IN NATURAL LANGUAGE MODE)— different fromLIKE, returns rows ranked by relevance. -
IN BOOLEAN MODEsupports+required,-excluded,*wildcardsemantics. - Does not accelerate
LIKE '%abc%'directly. TheLIKEquery stays sequential; you have to rewrite the application to useMATCH AGAINST.
Snowflake — search optimisation service.
-
ALTER TABLE notes ADD SEARCH OPTIMIZATION ON EQUALITY(body), SUBSTRING(body);enables the service. - Internally Snowflake builds a hidden index that accelerates
=,LIKE,ILIKE,RLIKE,REGEXP_LIKE, andSEARCH()predicates. - Cost model: storage + compute uplift; only worth it on tables with 10⁸+ rows and frequent point / contains queries.
BigQuery — the SEARCH function and search indexes.
-
CREATE SEARCH INDEX idx_notes_body ON notes(body)builds a token-based index. - Query:
WHERE SEARCH(body, 'timeout')— tokenises like FULLTEXT. - Does not directly accelerate
LIKE '%abc%'; you rewrite toSEARCH()to get the index path.
External search — Elasticsearch / OpenSearch.
- For petabyte-scale text search, dedicated faceting, geo / fuzzy / ranking — none of which the warehouse models well — push the column to ES via a CDC pipeline (Debezium → Kafka → ES sink).
- The warehouse keeps the system of record; ES serves search queries with millisecond p95.
The decision tree.
-
Pattern is anchored (
'abc%') → B-tree index on the column. Free. -
Pattern is unanchored, table is small (< 1 M rows) → just run the sequential
LIKE '%abc%'. Operationally fine. -
Pattern is unanchored, table is medium (1 M – 100 M rows), Postgres →
pg_trgm+ GIN. The mainline answer. - Pattern is unanchored, table is medium, MySQL / Snowflake / BigQuery → FULLTEXT / search optimisation / SEARCH index respectively, with a SQL rewrite to use the right syntax.
- Pattern is unanchored, table is large (100 M+), or queries need ranking / faceting → push to Elasticsearch / OpenSearch. The warehouse is not the right tool.
Worked example — build a trigram-backed LIKE '%abc%' on Postgres
Detailed explanation. A notes.body column has 30 million rows. The query WHERE body ILIKE '%timeout%' runs in 42 seconds today (full scan). After installing pg_trgm and creating a GIN index, the same query runs in 18 milliseconds. The mechanic worth understanding is the trigram extraction → GIN lookup → bitmap scan pipeline.
Question. Show the full set-up: create the extension, build the GIN index, run the slow query before and the fast query after, then explain the planner's choices with EXPLAIN ANALYZE.
Input (table shape).
| column | type | rows |
|---|---|---|
| id | bigint | 30,000,000 |
| created_at | timestamptz | — |
| body | text | avg 280 chars |
Code.
-- 1. One-time setup
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_notes_body_trgm
ON notes USING GIN (body gin_trgm_ops);
-- 2. The query (unchanged shape)
SELECT id, body
FROM notes
WHERE body ILIKE '%timeout%';
-- 3. Inspect the plan
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, body FROM notes WHERE body ILIKE '%timeout%';
Step-by-step explanation.
-
CREATE EXTENSION pg_trgm— loads the trigram operators (%,<%,%>) and the GIN opclasses (gin_trgm_ops). -
CREATE INDEX … USING GIN (body gin_trgm_ops)— builds an inverted index. Internally, for each row Postgres extracts every trigram of thebodycolumn and adds(trigram → row_id)pairs to the index. Build time on 30 M rows is typically 20–40 minutes; disk size ≈ 2× the table. -
body ILIKE '%timeout%'— the planner now considers the GIN index. It extracts the trigrams from the pattern ('tim','ime','meo','eou','out') and intersects the row-id lists for each. - Bitmap index scan + recheck — the planner emits a bitmap of candidate row ids, then rechecks the predicate against the actual row (because some trigram matches may be false positives at the pattern level). Typical candidate count is 0.01 % of the table.
-
Result — the same query plan that used to be a
Seq Scanis now aBitmap Heap Scanonidx_notes_body_trgmplus a recheck forbody ~~* '%timeout%'.
Output (before / after).
| Phase | Plan | Buffers | Wall time (p50) |
|---|---|---|---|
| Before | Seq Scan on notes | 580,221 shared hit | 42.0 s |
| After | Bitmap Index Scan + Recheck on idx_notes_body_trgm | 380 shared hit | 18 ms |
Rule of thumb. Trigram + GIN is the single biggest unlock for "contains-search" workloads on Postgres. The trade-off is index size (≈ 2× table) and slower writes (every insert touches the index). For columns that are append-mostly and read-heavy, the trade is overwhelmingly worth it.
Worked example — MySQL MATCH AGAINST boolean search on a comments column
Detailed explanation. MySQL's FULLTEXT index is the equivalent of Postgres' trigram GIN for whole-word text search, but the API is different: you don't LIKE '%word%' against a FULLTEXT-indexed column — you use MATCH(col) AGAINST('word'). Forgetting to rewrite the query is the most common reason a FULLTEXT index "doesn't help."
Question. A forum_posts.body MySQL column has a FULLTEXT index. The existing query WHERE body LIKE '%refactor%' is still slow. Rewrite it three ways: natural-language mode, boolean mode with +required and -excluded, and a phrase search. Then explain why each one uses the index and why the original LIKE did not.
Input (sample rows).
| id | body |
|---|---|
| 1 | We need to refactor the orders module. |
| 2 | The refactoring went well; tests pass. |
| 3 | I'm not a fan of refactor for refactor's sake. |
| 4 | Please review the PR before merging. |
| 5 | Refactoring duplicates, but keeping the auth flow. |
Code.
-- 0. The FULLTEXT index (already present in this scenario)
ALTER TABLE forum_posts ADD FULLTEXT INDEX idx_body (body);
-- 1. Natural-language mode (default) — relevance-ranked
SELECT id, body
FROM forum_posts
WHERE MATCH(body) AGAINST('refactor');
-- 2. Boolean mode — required `refactor`, excluded `auth`
SELECT id, body
FROM forum_posts
WHERE MATCH(body) AGAINST('+refactor -auth' IN BOOLEAN MODE);
-- 3. Phrase search — double-quoted exact phrase
SELECT id, body
FROM forum_posts
WHERE MATCH(body) AGAINST('"refactor the"' IN BOOLEAN MODE);
Step-by-step explanation.
-
The original
LIKE '%refactor%'doesn't use the FULLTEXT index at all — FULLTEXT is only consulted when the query usesMATCH(col) AGAINST(...). Rewriting is mandatory. -
Natural-language mode — stems
refactorinto a search token; rows 1, 2, 3 match (all containrefactororrefactoring). Row 5 (Refactoring) matches because MySQL's FULLTEXT stems word endings. -
Boolean mode with
+refactor -auth—+makesrefactormandatory;-authexcludes any row whose body containsauth. Row 5 is excluded because ofauth flow. -
Phrase search — double-quoted
"refactor the"requires that exact two-word sequence. Only row 1 matches; row 3 (refactor for refactor's sake) does not because the phrase shape is different. -
Why FULLTEXT but not LIKE — FULLTEXT tokenises the body once at insert time, building an inverted index of
(token → row_id).MATCH AGAINSTconsults that index.LIKEis a per-row substring matcher that ignores the index entirely.
Output.
| Query | Matches | Notes |
|---|---|---|
natural-language refactor
|
1, 2, 3, 5 | stems to refactor*
|
boolean +refactor -auth
|
1, 2, 3 | row 5 excluded by auth
|
phrase "refactor the"
|
1 | exact phrase |
Rule of thumb. Adding a FULLTEXT index without rewriting the queries to MATCH AGAINST gives you index storage cost with zero query benefit. Always pair the schema change with the query rewrite, and verify with EXPLAIN that the plan uses fulltext and not ALL (sequential scan).
Worked example — Postgres tsvector + tsquery for natural-language search
Detailed explanation. pg_trgm is great for substring-containment search; tsvector is great for natural-language search — stemming, stop-word removal, ranking. They serve different jobs. A common interview probe is: "Which one do you reach for, and how do you wire it up?"
Question. A blog_posts table needs natural-language search across title and body with relevance ranking. Show the schema change, the GIN index, and the search query with ranking.
Input (sample rows).
| id | title | body |
|---|---|---|
| 1 | Refactoring monoliths | We refactored the orders service this quarter. |
| 2 | The art of refactoring | Refactor your code before you scale your team. |
| 3 | Auth flow migration | Moving from sessions to JWT was a multi-week refactor. |
| 4 | Holiday party recap | The whole team turned up. |
Code.
-- Schema change — add a generated tsvector column and a GIN index
ALTER TABLE blog_posts
ADD COLUMN search_doc tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(body, '')), 'B')
) STORED;
CREATE INDEX idx_blog_posts_search_doc
ON blog_posts USING GIN (search_doc);
-- Query — natural language with ranking
SELECT
id,
title,
ts_rank(search_doc, to_tsquery('english', 'refactor & code')) AS score
FROM blog_posts
WHERE search_doc @@ to_tsquery('english', 'refactor & code')
ORDER BY score DESC;
Step-by-step explanation.
-
Generated
tsvectorcolumn — Postgres computes a normalised, stemmed, stop-word-stripped representation oftitle + bodyon every insert / update.setweighttags title tokens with weightA(highest) and body tokens with weightB. -
GIN index on
search_doc— turns the inverted-index lookups into millisecond bitmap scans. -
to_tsquery('english', 'refactor & code')— parses the user input into atsquerywith English stemming. The&is the boolean AND operator;|would be OR,!would be NOT. -
search_doc @@ tsquery— the match operator. Returns true if every required token is present. -
ts_rank(search_doc, tsquery)— assigns a relevance score combining token frequency and the weights from step 1. Row 2 ranks above row 1 because bothrefactorandcodeare present.
Output.
| id | title | score |
|---|---|---|
| 2 | The art of refactoring | 0.41 |
| 1 | Refactoring monoliths | 0.18 |
Rule of thumb. Reach for tsvector + tsquery when the user is writing a natural-language search box ("refactor code", "auth migration") and expects ranking. Reach for pg_trgm when the query is a programmatic LIKE '%substring%'. They're complementary, not competitive.
SQL interview question on choosing between trigram, FULLTEXT, and external search
A senior probe: "We have 100 million customer support tickets in a Postgres column. Analysts run LIKE '%refund%', LIKE '%timeout%', LIKE '%fraud%' style queries dozens of times per day, plus ad-hoc free-form searches. Walk me through how you'd architect the search layer." The right answer compares three options on five axes.
Solution Using a tiered architecture — trigram first, external ES second
-- Tier 1 — Postgres pg_trgm + GIN on the canonical column
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_tickets_body_trgm
ON tickets USING GIN (body gin_trgm_ops);
-- Tier 2 — When queries grow beyond contains-search (ranking, fuzzy, faceting),
-- stream tickets to Elasticsearch via Debezium → Kafka → ES sink.
-- Keep Postgres as the system of record; ES serves search-only.
Step-by-step trace.
| Stage | What it solves | What it doesn't |
|---|---|---|
pg_trgm GIN on body
|
LIKE '%refund%', ILIKE '%TimeOut%', regex with literal substrings |
Boolean operators, ranking by relevance, fuzzy / typo-tolerant search |
Add tsvector + tsquery + GIN |
natural-language word search inside Postgres (@@) |
Multi-language analyzers, complex aggregations on facets |
| Add Debezium → Kafka → ES | ranked search, facets, geo, multi-language analyzers, p95 milliseconds | New cluster to operate; eventual consistency window |
Output:
| Query shape | Latency target | Right tier |
|---|---|---|
LIKE 'PREFIX-%' |
< 10 ms | B-tree on column (no change) |
ILIKE '%refund%' |
< 50 ms | tier 1 — pg_trgm GIN |
ILIKE '%refnd%' typo-tolerant |
< 100 ms | tier 3 — Elasticsearch with fuzzy match |
Boolean: +refund -duplicate +urgent
|
< 200 ms | tier 2 or 3 — tsquery or ES |
Facet count by status, agent, category for "refund" |
< 500 ms | tier 3 — ES with aggregations |
Why this works — concept by concept:
-
Trigram index as the 80 % answer — for plain contains-search,
pg_trgmcovers most analyst queries with one extension and one index. No new system to operate. -
tsvectorfor natural-language search — when you need stemming (run,running,runsall match), stop-word handling, and ranking, Postgres' built-in full-text search is the middle tier without leaving the database. - External ES for the long tail — once you need fuzzy matching, multi-language analyzers, geo queries, or rich facets, Elasticsearch / OpenSearch becomes the right tool. CDC keeps Postgres as the source of truth.
- Operational reality — you don't deploy ES on day one for a feature that fits inside pg_trgm. You upgrade tiers when query shape demands it, not before.
- Cost — pg_trgm = +2× column storage, +20–40 % write amplification; tsvector = +30–80 % column storage; ES cluster = a separate infra footprint with its own ops budget.
SQL
Topic — indexing
Indexing problems (B-tree, GIN, trigram)
Pattern-matching recipes (cheat sheet)
-
Starts with.
col LIKE 'prefix%'— anchored prefix, index range scan, the fastest shape. -
Ends with.
col LIKE '%suffix'— full scan; fix with a reversed-string functional index for one-off suffix searches, or trigram for general unanchored. -
Contains (small dataset).
col LIKE '%substring%'— full scan is fine if the table is under ~1 M rows or the query is rare. -
Contains (big dataset). Postgres →
pg_trgm+ GIN. MySQL → FULLTEXT MATCH AGAINST. Snowflake → search optimisation service. BigQuery → search index +SEARCH(). -
Case-insensitive contains. Postgres / Snowflake →
col ILIKE '%abc%'. MySQL →LIKE(default CI). SQL Server →LIKE '%abc%' COLLATE …_CI_AS. BigQuery →REGEXP_CONTAINS(col, r'(?i)abc'). Portable →LOWER(col) LIKE LOWER('%abc%')(slow, kills indexes). -
Email validate (quick).
col LIKE '%@%.%'— 90 % of the strictness for 10 % of the work, supported everywhere. -
Email validate (strict).
col ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'— character class + quantifier; needs regex. -
Extract domain.
(regexp_match(email, '^[^@]+@(.+)$'))[1]on Postgres;REGEXP_EXTRACT(email, r'@(.+)$')on BigQuery. -
Exactly N characters.
col LIKE '___'(three of anything); orcol ~ '^.{3}$'if you want to be explicit. -
Escape a literal
%.col LIKE '50\%' ESCAPE '\'— match the four-character string50%. -
Multi-pattern OR. Postgres →
col ~ 'foo|bar|baz'(regex alternation) or chainedOR col LIKE. Snowflake →LIKE_ANY('%foo%', '%bar%'). -
Negation.
col NOT LIKE '%spam%'orcol !~ 'spam'— same scan cost as the positive shape. -
Anchored regex on indexed column. Postgres can use
pg_trgmto acceleratecol ~ 'abc'if'abc'is a literal substring of the pattern. Useful for partial regex queries.
Frequently asked questions
LIKE vs REGEXP — when should I use each?
Reach for LIKE whenever the pattern is expressible with just % and _ wildcards — starts-with, ends-with, contains, fixed-length-shaped strings. It's simpler, faster (when anchored, the B-tree index helps), and supported on every SQL dialect including SQL Server. Switch to REGEXP only when you need character classes ([A-Z]), quantifiers ({3,5}), alternation ((cat|dog)), capture groups, or back-references — things LIKE literally cannot express. A useful test: if you can write the pattern with only %, _, and literal characters, stay on LIKE.
Why is my LIKE '%abc%' query slow?
Because the leading % means the database cannot use a B-tree index on the column — it has to read every row and run the matcher against it. On a 100-row dev table this is invisible; on a 10-million-row prod table it's a 30–60 second sequential scan. The three production fixes are: (1) anchor the pattern to the start if possible (LIKE 'abc%' is a range scan, blazingly fast); (2) install a trigram index on Postgres (CREATE INDEX … USING GIN (col gin_trgm_ops)); (3) move text search to a dedicated index — FULLTEXT on MySQL, search optimisation on Snowflake, SEARCH index on BigQuery, or external Elasticsearch for petabyte-scale.
Is LIKE case-sensitive?
Depends on the dialect and the column collation. Postgres, Snowflake, and BigQuery treat LIKE as case-sensitive by default — use ILIKE (Postgres / Snowflake) or REGEXP_CONTAINS(col, r'(?i)…') (BigQuery) for case-insensitive matches. MySQL treats LIKE as case-insensitive by default because the standard collation utf8mb4_0900_ai_ci is case-insensitive — switch to a _bin or _cs collation if you need case-sensitivity. SQL Server defers to the column's collation entirely, and you can override per-query with COLLATE Latin1_General_CI_AS (case-insensitive) or Latin1_General_CS_AS (case-sensitive).
How do I escape % and _ in a LIKE query?
Use the ESCAPE clause to designate any single character as the escape: WHERE col LIKE '50\%' ESCAPE '\' matches the literal four-character string 50%. The escape character can be anything not appearing in your data — !, #, | are common choices when \ is already used (e.g. file paths). Without the ESCAPE clause, \% is just two literal characters \ and %-as-wildcard, which usually isn't what you want. This works the same on Postgres, MySQL, SQL Server, Snowflake, and BigQuery — it's part of the SQL standard.
Does REGEXP use an index?
In general no — a regex predicate forces a full sequential scan because the planner cannot prove the pattern matches a contiguous range in a sorted B-tree. The big exception is Postgres with pg_trgm + GIN: if the regex contains a substantial literal substring (≥ 3 characters), the planner extracts the trigrams from the pattern and uses the GIN index to narrow the candidate set, then rechecks each candidate against the full regex. MySQL's FULLTEXT and BigQuery's SEARCH are word-based, not regex-aware, so they don't accelerate REGEXP directly. The senior answer: pair regex with a trigram index on Postgres, or rewrite to LIKE 'literal-prefix%' AND col ~ 'rest-of-pattern' so the planner can use a B-tree for the anchored part.
Is LIKE '%abc%' with a leading wildcard ever production-safe?
Yes, in two scenarios. First, when the table is small — under roughly 1 million rows on a warehouse, under 100 thousand on a transactional database — the full scan finishes fast enough that nobody cares. Second, when the column has a trigram index (pg_trgm on Postgres, FULLTEXT on MySQL, search optimisation on Snowflake) the leading wildcard is no longer a problem because the index handles the unanchored case directly. The danger zone is "table grew without anyone noticing" — a query that ran in 200 ms at 100 K rows takes 60 seconds at 30 M, and the leading wildcard is almost always the culprit. Run EXPLAIN ANALYZE on every LIKE '%…%' query in your nightly job set, and add trigram indexes anywhere the scan dominates.
Practice on PipeCode
- Drill the pattern-matching practice library → for end-to-end LIKE, ILIKE, and REGEXP problems.
- Pair them with string function problems → to sharpen
SUBSTR,REPLACE, andSPLIT_PARTmuscle memory. - Pressure-test with filtering drills → — pattern matching shows up in nearly every non-trivial
WHEREclause. - Reinforce the index side with indexing problems → — B-tree, GIN, and trigram all live here.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Build the full SQL muscle group with SQL for data engineering interviews — from zero to FAANG →.
- For the design-round muscles, work through ETL system design for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every pattern-matching shape above ships with hands-on practice rooms where you write the SQL, run it against real data, and watch the planner choose the right index. Start with the pattern-matching library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.



![Visual diagram of REGEXP capture groups — input string 'alice@example.com' on the left, regex pattern '^([^@]+)@(.+)$' with two capture groups colour-coded, output showing group 1 'alice' and group 2 'example.com'; a 4-dialect function chip strip at the bottom; on a light PipeCode card.](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzkmu16b3r6nky0j2chxx.jpeg)

Top comments (0)