Discussion on: My favourite computer science text (OR "how regex works")

View post

It would be nice to know the exact limitations of Regex. Like, why cannot it parse XML?

The article doesn't discuss much of the theory behind regex, but it does talk about DFA and NFAs, which are finite automata, and how regex can be turned into either, and how NFAs and DFAs can be turned into regex. The fact that they can be converted to each other means they are capable of recognizing exactly the same languages, known as regular languages.

Now what's important about "finite automata" is that they have a finite amount of states. Each letter parsed is equivalent to a state transition. Let's say you had a regex capable of recognizing all valid XML. If the corresponding finite automata has N states, and I give N+1 starting tags <mytag>, then XML requires N+1 </mytag>, but the finite automata can at most distinguish between N different amounts of starting tags. That means it won't be capable of counting down the N+1 ending tags that are required. It just doesn't have enough different states to track the progress from N+1 to 0. This argument holds for any regex you could ever try to recognize XML with. XML simply isn't a regular language. My argument closely resembles what's called the pumping lemma for regular languages.