DEV Community

Cover image for How a 14-character regex destroyed every comparison table in my AI content pipeline
Rami Mamar
Rami Mamar

Posted on • Originally published at seohive.io

How a 14-character regex destroyed every comparison table in my AI content pipeline

I run an AI content pipeline that publishes 30 SEO articles a month. It has six phases: outline, write, fact-check, tone-QA, internal-link rewire, validate-and-repair. Every published article gets schema, FAQ, citations, the works.

Last week I shipped 14 articles in one batch. Two of them had GFM markdown tables (the | Tool | Pricing | Trade-off | kind). When the articles landed on the live site, the tables rendered as visual rubble:

| Perplexity   | Real-time research | Free + Pro     |
|, - |, - |, - |
| ChatGPT      | Long-form synthesis | Free + Plus    |
Enter fullscreen mode Exit fullscreen mode

Look at the second row. That's supposed to be a GFM separator: | --- | --- | --- |. Instead it's |, - |, - |, - |.

The bug had been silently corrupting every comparison-table article for as long as the pipeline had been running. I just hadn't noticed because most articles don't emit tables.

The fix is one line. The lesson is bigger.

What actually happened

My pipeline has a phase called "tone-QA." It runs a deterministic regex scrubber against the draft before the LLM rewrite stage. The scrubber strips classic AI tells: em-dashes used as commas, "in today's landscape," "leverage," "robust," "delve into."

The em-dash rule looks like this:

{ pattern: /[ \t]*[–—][ \t]*/g, replacement: ', ', flag: 'em/en-dash → comma' }
Enter fullscreen mode Exit fullscreen mode

That one is fine. It only matches actual em-dashes (, U+2014) or en-dashes (, U+2013).

The bug is in the next rule:

{ pattern: /[ \t]*--[ \t]*/g, replacement: ', ', flag: 'double-hyphen → comma' }
Enter fullscreen mode Exit fullscreen mode

Double-hyphen (--) is a common ASCII em-dash stand-in. The regex matches -- with optional whitespace on either side, replaces with ,.

Now consider the GFM table separator row:

| --- | --- | --- |
Enter fullscreen mode Exit fullscreen mode

Each --- is three hyphens. The regex doesn't ask if the -- is followed by another -. It just greedily matches the first two.

Trace it on | --- |:

  • Position 0: | — no match (regex needs -- somewhere).
  • Position 1: space — [ \t]* consumes it (1 char). Then -- matches positions 2-3. Then [ \t]* tries position 4 — that's -, not whitespace, so it matches zero chars. Total match span: positions 1-3, which is " --". Replace with ", ". Result: |, - |.

The trailing dash from --- stays. The leading two dashes get eaten. Multiply by five separator cells across a row, and the table separator becomes:

|, - |, - |, - |, - |, - |
Enter fullscreen mode Exit fullscreen mode

Markdown parsers see no separator. The whole table renders as broken paragraph text.

How I caught it (eventually)

I caught it via a manual-LLM testing mode I'd added to the pipeline a session earlier. The idea: don't actually call any LLM API; instead write each phase's prompt to a JSON file, hand-author the response, drop it into the cache, and re-run. Lets you debug the full pipeline without burning tokens.

When I ran one article through this manual mode, I saw the tone-QA phase's INPUT had a clean table separator (| --- | --- |) and its OUTPUT had the corrupted one (|, - |, - |).

Between input and output there's only one thing: the deterministic regex scrubber. The LLM gets called after, but the corruption was already done.

Diff between two cached files in the cache directory. Five seconds to find the line.

The fix

Negative lookbehind + lookahead. Don't match -- if it's part of a --- run.

// Was
{ pattern: /[ \t]*--[ \t]*/g, replacement: ', ', flag: 'double-hyphen → comma' }

// Now
{ pattern: /(?<!-)[ \t]*--(?!-)[ \t]*/g, replacement: ', ', flag: 'double-hyphen → comma' }
Enter fullscreen mode Exit fullscreen mode

Verification cases:

Input Old output New output
`\ --- \ --- \
{% raw %}before---after before, -after before---after
`\ ----\ ` (4 dashes)
word--word (real em-dash use) word, word word, word
word -- word word, word word, word

The lookbehind says "don't match if the previous character is a dash." The lookahead says "don't match if the next character is a dash." So -- only fires when it's standalone (or surrounded by whitespace), never inside a longer hyphen run.

Tables survive. Real em-dash stand-ins still get scrubbed. Triple-dash separators (and quadruple, and the entire --- family) are now safe.

Three lessons I'm taking forward

1. Markdown is not a string

Markdown contains structural delimiters that look like prose. My scrubber treated --- as ASCII punctuation when it was actually a table separator. The fix isn't just my one regex; it's a category. Anywhere the pipeline regex-mutates markdown, I now mentally check: could this match a delimiter token?

Similar landmines I added test cases for after this bug:

  • **bold** (asterisks used as emphasis)
  • `code` (backticks)
  • [link](url) (the URL might contain characters my scrubber strips)
  • > blockquote (the leading >)
  • # heading (the leading #)

The principle: regex on markdown is regex on a parser input, not on prose.

2. Multi-step pipelines need step-level diffs

I had been using the pipeline for months. The bug existed for months. I caught it in 90 seconds when I could inspect the input and output of each phase as separate files.

Black-box pipelines hide their failures. The fix for that is mechanical: dump every step's I/O to disk when running in debug mode. The whole "manual LLM cache" pattern I'd built for cost reasons turned out to be a debugging multiplier I hadn't expected. A regex scrubbing bug between phase 4 and phase 5 was invisible from the published output but completely obvious in the on-disk diff.

If you have any AI pipeline with more than two phases, build the I/O snapshot now. You'll find a bug.

3. Visual output requires visual checking

Every other phase in the pipeline (factcheck, tone-QA structural rules, citability scoring) was producing telemetry numbers. Word counts. Citation match counts. Score breakdowns. The table corruption produced no numeric signal: the article still hit its word count, still had headings, still had a "key takeaways" block. Nothing flagged.

Now my pipeline includes a structural-presence check: if a brief's tableSpec was set, the final markdown is checked for a GFM table (| ... | line followed by | --- | separator). The validator surfaces missingRequiredTable: true if it's absent.

Three months of writing tables, none of them present in the published HTML, zero alerts. Visual outputs need visual checks.

The takeaway

The bug is a one-liner. The lesson is that every regex I touch from now on lives next to a test case for the delimiters it could accidentally match.

If you run an AI content pipeline of any kind, search your codebase for --, em-dash, , &mdash;. Make sure none of the scrubbers walks into a markdown table separator. You'll find one. Promise.


I run SeoHive, a productized AI SEO platform. The bug was real, the articles are live with the fix in place, and the pipeline now publishes 30 articles a month without eating its own tables.


Top comments (0)