SEN LLC

Posted on May 28

A Click-to-Build Regex Builder in 500 Lines of Vanilla JS — Token Model and Live Highlight Internals

#javascript #regex #webdev #devtools

Most developers can read regex more than they can write it. You see \d{2,4}-\d{1,4}-\d{4} and think "phone number." But starting from scratch you reach for a reference. I built a 500-line vanilla JS visual builder where you click tokens to assemble the regex, with live match highlighting against test text. 21 token types, 26 unit tests, no dependencies, no build step.

🌐 Demo: https://sen.ltd/portfolio/regex-builder/
📦 GitHub: https://github.com/sen-ltd/regex-builder

The token model is everything

The whole design rests on one observation: a regex is a concatenation of tokens. Each token is either parameterless (it always emits the same string) or parameterised (it takes user input and emits something based on it):

// No parameter — emit the same string every time
{ id: "digit",        pattern: "\\d" }
{ id: "any",          pattern: "." }
{ id: "q-one-or-more", pattern: "+" }

// Parameterised — user input lives in `.value`, paramFn builds the string
{ id: "literal",     paramFn: (v) => escapeForRegex(v) }
{ id: "char-class",  paramFn: (v) => `[${v}]` }
{ id: "q-exact",     paramFn: (v) => `{${v}}` }

That's it. All 21 tokens (character classes, quantifiers, anchors, groups) fit this model. Compilation is just joining each token's output:

export function compile(tokens) {
  return tokens.map((t) => {
    const def = getTokenDef(t.id);
    if (!def) return "";
    if (def.paramFn) return def.paramFn(t.value);
    return def.pattern ?? "";
  }).join("");
}

Test that an email-ish pattern composes correctly:

test("anchored email-ish pattern", () => {
  const tokens = [
    { id: "start" }, { id: "word-char" }, { id: "q-one-or-more" },
    { id: "literal", value: "@" }, { id: "word-char" }, { id: "q-one-or-more" },
    { id: "literal", value: "." }, { id: "word-char" }, { id: "q-one-or-more" },
    { id: "end" },
  ];
  assert.equal(compile(tokens), "^\\w+@\\w+\\.\\w+$");
});

The @ and . go through the literal token, which auto-escapes them. The user never needs to remember regex metacharacters.

Escaping literals — hit every metachar

const ESCAPE_RE = /[.*+?^${}()|[\]\\\/]/g;

export function escapeForRegex(s) {
  return String(s).replace(ESCAPE_RE, "\\$&");
}

The full set: .* + ? ^ $ { } ( ) | [ ] \ /. The / isn't strictly required for new RegExp() since we're not in literal syntax, but users will paste the result into /pattern/flags form so we escape it preemptively.

$& is the substitution token for "the full match," so the replacer becomes \\<the metachar> — one char rewritten in place. Backslash gets doubled (a\b → a\\b).

test("escapes regex metacharacters", () => {
  assert.equal(escapeForRegex("3.14"), "3\\.14");
  assert.equal(escapeForRegex("[a]"), "\\[a\\]");
});

test("non-meta chars pass through unchanged", () => {
  assert.equal(escapeForRegex("hello world"), "hello world");
  assert.equal(escapeForRegex("日本語"), "日本語");
});

Non-Latin characters (Japanese here) have no meaning in regex so they pass through untouched.

Quantifiers as positional tokens

\d+ means "digit, one or more." Naively you'd model it as Digit + quantifier-modifier. In the token list it's just [digit, q-one-or-more] — two independent tokens. Compile concatenates them in order, and the regex engine interprets + as modifying the preceding token, which happens to be exactly what we want.

[digit, q-one-or-more]                 → \d+
[literal("#"), digit, q-one-or-more]   → #\d+
[group-open, literal("cat"), alternation, literal("dog"), group-close]
                                       → (cat|dog)

The token catalog has a modifiesPrevious: true flag on quantifiers, but the current compiler doesn't use it. We'll need it the day we add drag-and-drop to attach a + directly to a \w chip and move them as a pair. Until then, positional concatenation gets us free correctness.

Live match highlighting — the segmentText pattern

To render highlights in <mark> spans, you have to split the text into alternating matched / unmatched segments first:

export function segmentText(text, matches) {
  const out = [];
  let cursor = 0;
  for (let i = 0; i < matches.length; i++) {
    const m = matches[i];
    if (m.start > cursor) {
      out.push({ text: text.slice(cursor, m.start), matched: false });
    }
    out.push({ text: text.slice(m.start, m.end), matched: true, matchIndex: i });
    cursor = m.end;
  }
  if (cursor < text.length) {
    out.push({ text: text.slice(cursor), matched: false });
  }
  return out;
}

The UI converts each segment to HTML:

$("highlighted").innerHTML = segs.map((s) =>
  s.matched
    ? `<mark>${escapeHtml(s.text)}</mark>`
    : escapeHtml(s.text)
).join("");

Because segmentText is pure, all the edge cases are unit-testable without a DOM:

test("multiple matches with gaps", () => {
  const segs = segmentText("a 1 b 2 c", [
    { start: 2, end: 3, text: "1" },
    { start: 6, end: 7, text: "2" },
  ]);
  // Expect: 5 segments alternating ["a ", "1", " b ", "2", " c"]
  assert.equal(segs.length, 5);
  assert.equal(segs[1].matched, true);
  assert.equal(segs[3].matched, true);
});

It's text.slice at the right boundaries — easy to write, easy to break. Tests catch every off-by-one I make.

Matching — `matchAll` vs `match`

String.prototype.matchAll requires the g flag. Without it, you get only the first match via .match(). Wrap the difference:

export function tryMatch(tokens, flags, text) {
  const pattern = compile(tokens);
  if (!pattern) return { ok: true, regex: null, matches: [] };

  let regex;
  try { regex = new RegExp(pattern, flags); }
  catch (e) { return { ok: false, error: e.message }; }

  const matches = [];
  if (regex.global) {
    for (const m of text.matchAll(regex)) {
      matches.push({ start: m.index, end: m.index + m[0].length, text: m[0], groups: m.slice(1) });
    }
  } else {
    const m = text.match(regex);
    if (m) matches.push({ start: m.index, end: m.index + m[0].length, text: m[0], groups: m.slice(1) });
  }
  return { ok: true, regex, matches };
}

Three things this gets right:

Invalid regex (unclosed paren, etc.) is caught and surfaced as an error string in the UI — better than the whole tool silently breaking.
Capture groups are collected via m.slice(1). The user can drop group-open and group-close tokens and capture-extraction works automatically.
Empty token list returns ok: true with matches: [] — no error, just nothing to match against. Lets the UI render cleanly on first paint.

Architecture

tokens.js   ← Token catalog + compile + tryMatch + segmentText + escapeForRegex
app.js      ← UI glue: catalog rendering, sequence chips, flag sync, live update

The dependency arrow goes app.js → tokens.js only. tokens.js has neither document nor window. new RegExp() is identical in browser and Node, so tryMatch is testable too — not just the string compilation:

npm test  # 26 tests, 173ms

The test breakdown:

escapeForRegex (3): metachar coverage, passthrough, backslash doubling
compile (8): empty, single token, literal escape, char class, negated class, exact / range quantifiers, email-ish, group + alternation
tryMatch (6): success, non-global limits, invalid regex, empty tokens, case-insensitive flag, capture groups
segmentText (4): no matches, match at start, match at end, gaps
catalog (3): unique IDs, has label + category, getTokenDef

To add a new token, write one or two tests against compile / tryMatch, then add an entry to TOKEN_CATALOG.

Try it

Demo: https://sen.ltd/portfolio/regex-builder/
GitHub: https://github.com/sen-ltd/regex-builder

The default state shows \d+ matching every number in a sample order line. Pop a literal "#" in front and you've narrowed it to order numbers. Wrap the whole thing in capturing parens to pull the digits out. The UI helps the regex make sense in steps.

Takeaways

A regex is a concatenation of tokens. That mental model fits all 21 character classes / quantifiers / anchors / groups into one shape.
Literal tokens auto-escape regex metacharacters so the UI doesn't ask users to remember . is special.
Quantifiers are positional, not modifiers — they work because the regex engine itself reads them that way.
segmentText is the pattern for any "highlight matches in text" feature: split into alternating matched/unmatched segments, then render. Unit-test it.
Wrap new RegExp in try/catch and surface errors as UI text. Invalid regex shouldn't break the whole tool.

This is OSS portfolio #245 from SEN LLC (Tokyo). We ship small, sharp tools continuously: https://sen.ltd/portfolio/

DEV Community

A Click-to-Build Regex Builder in 500 Lines of Vanilla JS — Token Model and Live Highlight Internals

The token model is everything

Escaping literals — hit every metachar

Quantifiers as positional tokens

Live match highlighting — the segmentText pattern

Matching — `matchAll` vs `match`

Architecture

Try it

Takeaways

Top comments (0)

The token model is everything

Escaping literals — hit every metachar

Quantifiers as positional tokens

Live match highlighting — the segmentText pattern

Matching — matchAll vs match

Architecture

Try it

Takeaways

Matching — `matchAll` vs `match`