Kristieene Knowles

Posted on Feb 26

When Regex Meets the DOM (And Suddenly It’s Not Simple Anymore)

#regex #javascript #webdev #frontend

I recently built a custom in-page “Ctrl + F”-style search and highlight feature.

The goal sounded simple:

Support multi-word queries
Prefer full phrase matches
Fall back to individual token matches
Highlight results in the DOM
Skip <code> and <pre> blocks

In my head?

“Easy. Just build a regex.”

Step 1: Build the Regex

If a user searches:

power shell

I generate a pattern like:

power[\s\u00A0]+shell|power|shell

The logic:

Try to match the full phrase first
If that fails, match individual tokens

On paper? Clean.

In isolation? Works.

Step 2: Enter the DOM

This is where things escalated.

Instead of just running string.match(), I had to:

Walk the DOM
Avoid header UI
Avoid <pre>, <code>, <script>, <style>
Avoid breaking syntax highlighting
Replace only text nodes
Preserve structure

That meant using a TreeWalker.

const walker = document.createTreeWalker(root, NodeFilter.SHOW_TEXT, {
  acceptNode(node) {
    const p = node.parentElement;
    if (!p) return NodeFilter.FILTER_REJECT;

    if (p.closest("code, pre, script, style")) {
      return NodeFilter.FILTER_REJECT;
    }

    return NodeFilter.FILTER_ACCEPT;
  },
});

Now we’re not just doing regex.
We’re doing controlled DOM mutation.

Step 3: The Alternation Problem

This is where it got interesting.

Even though the phrase appears first in the alternation:

phrase|token1|token2

The engine still happily matches:

power
shell
PowerShell

Depending on context.

So now the problem isn’t “regex syntax”.

It’s:

Overlapping matches
Execution order
Resetting lastIndex
Avoiding double mutation
Preventing nested <mark> elements

Step 4: Two Passes?

At one point I thought:

Maybe this shouldn’t be one regex.

Maybe the logic should be:

Try phrase match
If none found, then try token match

Which sounds simple…

Until you realise your DOM has already been mutated once.

Now you’re managing state across passes.

The Realisation

I understand JavaScript logic.

I understand regex.

But applying that logic safely across a live DOM tree?

That’s a different tier of problem.

Regex is deterministic.
The DOM is structural and stateful.

And once you start replacing text nodes, everything becomes delicate.

What I Learned

Regex problems are easy in isolation.
DOM mutation problems are easy in isolation.
Combining them multiplies complexity.

Also:

The line between “simple feature” and “mini search engine” is very thin.

Where I Am Now

The search works.

Mostly.

It highlights.
It skips protected blocks.
It respects structure.

But it’s not a browser-level Ctrl + F.
Not yet.

And that’s the interesting part.

I now respect the DOM far more than I did before.

And I never thought I’d say this sentence naturally:

I get the logic of JavaScript.
Making that logic behave predictably inside a living DOM tree is the real challenge.

There’s still refinement to do.
Edge cases to tame.
State to simplify.

But that’s the line between “feature complete” and “actually robust.”

And I’m somewhere in the middle of that line.

DEV Community