DEV Community

Cover image for When Regex Meets the DOM (And Suddenly It’s Not Simple Anymore)
Kristieene Knowles
Kristieene Knowles

Posted on

When Regex Meets the DOM (And Suddenly It’s Not Simple Anymore)

I recently built a custom in-page “Ctrl + F”-style search and highlight feature.

The goal sounded simple:

  • Support multi-word queries
  • Prefer full phrase matches
  • Fall back to individual token matches
  • Highlight results in the DOM
  • Skip <code> and <pre> blocks

In my head?

“Easy. Just build a regex.”

Step 1: Build the Regex

If a user searches:

power shell
Enter fullscreen mode Exit fullscreen mode

I generate a pattern like:

power[\s\u00A0]+shell|power|shell
Enter fullscreen mode Exit fullscreen mode

The logic:

  • Try to match the full phrase first
  • If that fails, match individual tokens

On paper? Clean.

In isolation? Works.


Step 2: Enter the DOM

This is where things escalated.

Instead of just running string.match(), I had to:

  • Walk the DOM
  • Avoid header UI
  • Avoid <pre>, <code>, <script>, <style>
  • Avoid breaking syntax highlighting
  • Replace only text nodes
  • Preserve structure

That meant using a TreeWalker.

const walker = document.createTreeWalker(root, NodeFilter.SHOW_TEXT, {
  acceptNode(node) {
    const p = node.parentElement;
    if (!p) return NodeFilter.FILTER_REJECT;

    if (p.closest("code, pre, script, style")) {
      return NodeFilter.FILTER_REJECT;
    }

    return NodeFilter.FILTER_ACCEPT;
  },
});
Enter fullscreen mode Exit fullscreen mode

Now we’re not just doing regex.
We’re doing controlled DOM mutation.


Step 3: The Alternation Problem

This is where it got interesting.

Even though the phrase appears first in the alternation:

phrase|token1|token2
Enter fullscreen mode Exit fullscreen mode

The engine still happily matches:

  • power
  • shell
  • PowerShell

Depending on context.

So now the problem isn’t “regex syntax”.

It’s:

  • Overlapping matches
  • Execution order
  • Resetting lastIndex
  • Avoiding double mutation
  • Preventing nested <mark> elements

Step 4: Two Passes?

At one point I thought:

Maybe this shouldn’t be one regex.

Maybe the logic should be:

  1. Try phrase match
  2. If none found, then try token match

Which sounds simple…

Until you realise your DOM has already been mutated once.

Now you’re managing state across passes.


The Realisation

I understand JavaScript logic.

I understand regex.

But applying that logic safely across a live DOM tree?

That’s a different tier of problem.

Regex is deterministic.
The DOM is structural and stateful.

And once you start replacing text nodes, everything becomes delicate.


What I Learned

  • Regex problems are easy in isolation.
  • DOM mutation problems are easy in isolation.
  • Combining them multiplies complexity.

Also:

The line between “simple feature” and “mini search engine” is very thin.

Where I Am Now

The search works.

Mostly.

It highlights.
It skips protected blocks.
It respects structure.

But it’s not a browser-level Ctrl + F.
Not yet.

And that’s the interesting part.

I now respect the DOM far more than I did before.

And I never thought I’d say this sentence naturally:

I get the logic of JavaScript.
Making that logic behave predictably inside a living DOM tree is the real challenge.

There’s still refinement to do.
Edge cases to tame.
State to simplify.

But that’s the line between “feature complete” and “actually robust.”

And I’m somewhere in the middle of that line.

Top comments (0)