Regex - What I Learned Trying to Parse HTML

Until now, I've used regex sparingly; mostly for extremely simple captures. Lately, I've been working on a personal project in my spare time, which involves parsing and editing HTML files. I may have been able to find a library with the functionality I needed, but I figured it would be a good opportunity to become more comfortable with regex.

I'll demonstrate a few of the issues I had along the way, and explain the solution to each. First, I'd like to recommend regexr to anyone who isn't already familiar with it. This was an incredibly valuable resource to have while I worked toward a solution.

The first thing I needed to do involved getting a list of the opening and closing tags, including all information, such as type, attributes, inner HTML, and indentation. This turned out to be the most difficult problem to solve.

The following is an approximate example of my first attempt at a solution.

/<.*?>[^<]*/g

I'll try to explain what each part of this expression is doing.

< - Start a match at <
.* - Match any character other than a line break
? - Make the previous part of the expression non-greedy; preventing it from overriding the next
> - Next character to look for is >
[^<]* - Include anything that is not <

This almost works, however, a problem arises when you get to a script element with a block of code which uses the less than operator, at which point we lose the remaining code.

After another half-hour or so spent looking for a solution, I stumbled across regexr, and discovered the positive lookahead. The positive lookahead allows you specify a group which will end the match without capturing the group.

/<.*?>[\s\S]*?(?=<.*>)/g

Again, I'll try to explain.

<.*?> - Same as above
[\s\S]*? - Capture everything, line breaks included, again, non-greedily
(?=<.*>) - Stop before you reach a set that matches <.*>

I really thought that I had it this time, but there are always more edge cases. Many of you can probably already see my mistake; maybe I should have taken a break at this point, but I was determined to resolve this.

Let's take a look at the first offender I ran into.

<script>
    if (var1 < 10 && var2 > 10) {
        ...
    }
</script>

After another extended search for a solution, I came up with the following.

/<.*?>[\s\S]*?(?=</?\w+.*>)/g

As usual, an explanation. This time, I'll just describe what was added to the positive lookahead.

/? - Check for zero or one forward slash
\w+ - check for one or more word characters

It was at this point that I realized I need to explicitly add each element to the lookahead, and even then there are probably going to be issues. For example, the following is valid JavaScript.

let a = 5;
if (5<a||a>10) {
    ...
}

Another problem I ran into, was that less than, and greater than characters may be represented within a string in the code.

console.log("<script></script>");

Out of curiosity, I did a little more research, and attempted to resolve this problem. Time for the negative lookbehind!

/<.*?>[\s\S]*?(?=(?<!"[\s\w]*|"[\s\w]*<.*>[\s\w]*)</?\w+.*?>)/g

Let's see if I can try to describe what's happening here. Again, I'll only focus on what was added to the positive lookahead, which happens to be a negative lookbehind.

(?<!...) - This is the negative lookbehind, meaning everything expressed here will prevent the following expression from being evaluated in the event that it matches.
"[\s\w]* - A " followed by zero or more whitespace or word characters
| - OR
"[\s\w]*<.*>[\s\w]* - Same as above, but check for a preceding element tag, with whitespace and word characters in mind.

With the exception of adding an OR case to the end of the expression to catch the last closing tag, this was as far as I have made it. It's extremely error prone, and I'll definitely need to formulate a better solution. I am, however, glad that I tried, as I was able to learn quite a bit.

Latest comments (2)

Tobias Nickel • Oct 25 '20

interesting thought, I also have an xml/html parser on npm and i directly had to test if it works right with your script tag example. It works, because I handled the script tag extra. The script tag could even contain comments with xml inside. I will add your example to its test cases.

When I needed an xml parser (in browser and worker) I was also checking regex, but they can not express that open and close-tag need to have the same name.

after some test with ebnf i just parsed xml strings just with js,...

Klaus Baldermann • Oct 25 '20

About regexes for HTML parsing, I like this from my link collection