Natural Language v. Regex: The Context Wars

#regex #machinelearning #datascience

The past decade has seen AI advance in leaps and bounds, with Deep Learning (DL) enabling many new applications and redefining industries. Computer Vision has had a particularly significant impact: prior to DL computers could not reliably process real-world images and videos with accuracies high enough to enable real-world applications. Natural Language Processing (NLP) is different in that, prior to Deep Learning, relatively effective techniques did exist, enabling applications such as spam filtering.

Enter the humble regular expression (regex), a method used to specify patterns to look for within text. Regexes have formed an integral part of computer software for decades, especially in Named Entity Recognition (NER) systems. NER applications include gathering keywords for search indexing, product intelligence, and (the application that Private AI works on) identifying sensitive information.

Say you're looking for credit cards, you can look for 16-digit numbers or 4 groups of 4 numbers separated by ‘-’. This works well in the perfect world of computers, but unfortunately the real world is much more complicated. Let's take phone numbers. Numbers come in different lengths, arrangements and change based on where the caller is located. The following numbers are all equivalent, for instance:

00049 153 43437 800 – Extra 0 to dial out of an office phone system
+49 153 434 37 800 – ‘+’ international dialing code format
01534 347800 – ‘Standard’ infra country format
15 343 478 00 – Shorthand people often drop the leading digit in their locale

Phone numbers can also contain letters. For example, people frequently write their internal office number just using their extension, typically starting with ’x’; e.g., ‘x51399’. Hotlines are another good example, often using the letters associated with each number to make the number easier to remember. For example, a taxi hotline in Australia writes ‘132227’ as ‘13taxi’.

Certain phone numbers also have different meanings. E.g. ‘911’ could be the Porsche 911, or a referral to the 9/11 attacks. Replacement part numbers are another good example.

Google actually built a system for finding phone numbers using regular expressions. The system was manually programmed to find numbers and check whether they’re valid for each country in the world. The system works well for regular numbers and is even able to catch the above example with an extra 0 for office systems. It is not however able to find alphanumeric numbers such as ‘x5177’ or ‘13taxi’.

As you can see, it’s usually not feasible to program all of these patterns in. In addition to the amount of costly developer time this would take to program, optimize, and maintain, such a system would get so many false matches that the output is no longer useful. Regex-based solutions also require a lot of work to maintain, as the expressions constantly need to be tweaked to account for new patterns and false matches. This task coincidentally happens to be one of the least favourite developer chores out there.

Today’s highly connected world requires international solutions dealing with regional differences in standardization.. For example, a German address could be ‘Eugen-Schoenhaar-strasse 21, 10423 Berlin’. The house number is written after the street name, instead of before whilst ‘strasse’ (street in German) is joined to the street name in one long compound word, because Germans love those. Postal codes are another good example. Dutch and Canadian postal codes are both made up of 6 digits. A Dutch postal code however is 4 digits followed by 2 letters (E.g. 1234AY), whereas a Canadian postal code is a mix of digits and letters (E.g. 1A32Y4). An Australian postal code on the other hand is only 4 digits.

In summary, the real world is a very complicated place, with even simple problems containing many different variations and edge cases. Building a system that performs reliably and that delivers a level of accuracy high enough for production applications requires handling each of these variants and edge cases, such as phone numbers containing letters. And that credit card number example? Well, it turns out not all credit cards are 16 digits long!.

AI Systems

As humans we can easily distinguish between the above examples to determine what is or is not a phone number, an address, etc. We do this by looking at the number, but also the context. E.g. “call 911!” and “gosh that's a nice 911”. Unlike regexes, AI models understand context and are therefore able to understand text more like humans do.

State of the art AI systems, like that of Private AI are trained on large amounts of carefully annotated data and meticulously revised to account for all these edge cases and locale-specific differences. In Private AI's case we reach >97% in-domain accuracy, which we have found to be higher than human performance in some settings.

AI systems also scale better and are easier to maintain. Integrating a change to a regex-based system requires careful analysis of the existing expressions, which become more and more difficult to comprehend as one adds to them, to make sure that any changes don’t affect existing expressions and adequately capture all the possible permutations of a term. It is common to ‘fix one thing, break another’ and it often requires a few iterations in production before a change is successfully made. All AI-based systems require, on the hand, is just extra training data.

A key drawback of AI-based systems in the past has been the need for a large amount of data to train the AI model. Modern techniques however allow AI-based systems to be developed with a fraction of the training data that was previously required. Private AI’s solution in particular can learn to generalize well from as few as 10 examples.

AI-based systems also typically require a tremendous amount of computing power – far more than regex-based systems. This can lead to large cloud bills or to difficulty integrating models into edge applications; such as, mobile apps or desktop applications. AI is also rapidly advancing in this area, the latest techniques allowing for large reductions in compute resources. At Private AI, we have spent a large amount of time optimizing our solution to the point where it is now 25x faster than BERT large, a popular NLP architecture, whilst also surpassing its performance.

Summary

Regexes have served as an integral part of computing systems for decades, and will continue to do so. AI-based systems however offer new levels of performance by understanding context similarly to humans. For unstructured real-world applications like detecting phone numbers in text, these new techniques enable a wide range of new, high-quality production applications.

DEV Community

Natural Language v. Regex: The Context Wars

AI Systems

Summary

Top comments (0)