Intro to Regular Expressions

#regex #programming #beginners #webdev

What are Regular Expressions?

Regular Expressions, or RegEx for short, are expressions written to find specific patterns. They can be used for things such as validation (think email address validation) or for searching for specific patterns in text.

Why RegEx?

When I was going through phase 3, learning Ruby at Flatiron School, we had a very small section based on RegEx. At first, it was really hard to wrap my mind around. The syntax, of course, is not near as clean as reading something like Ruby. It was something I felt like I could spend a week on. However, there are a lot of use cases for RegEx, and it can be a powerful tool when wielded wisely.

Interestingly, shortly after I read about RegEx for the first time at Flatiron, it actually came in handy at work. The first example I saw of it being utilized was when watching one of our engineers search for something specific in a large block of code. He used RegEx to find exactly what he was looking for, and he found it quickly.

The second example was actually something I was able to use personally. I work in Sales Operations, and we recently acquired a piece of software that we can use to help de-dupe our Salesforce instance. In order to do so, I had to set up the rules that would be used in order to match two Leads as being duplicates.

The problem I was running into was that we wouldn't always collect clean data on our Leads. For example, if someone was to go through our Lead-capturing flow on our website, we ask them for a lot of data in a step-by-step manner, but we don't ask for their name or company name up front. We gather the important data first, such as data that qualifies them to do business with us and their email address.

The result is that a lot of Leads come in with a first name, last name, and/or company name of "unknown." When trying to deduplicate these Leads, we'd like to be able to fuzzy match first names, last names, and company names, but we can't effectively do that if a good chunk of our database has "unknown" in those fields.

Enter RegEx.

In our deduplication tool, we have the option of using RegEx as a field in our matching rules which are used to match two Leads as duplicates. We don't have the option to simply exclude specific strings in the first name, last name, or company name fields, but we do have the ability to use a regular expression. With RegEx, I can very easily weed out any first names, last names, or company names that include "unknown" or "test." The code I used to do this is as follows:
/\b(?!\bunknown|Unknown|test|Test\b)\w+\b/

What this does is actually removes any Leads from being matched as duplicates that contain the words "unknown," "Unknown," "test," and "Test." You can see this in effect by using Rubular:

What you see in this paragraph is that the only words the are excluded are "unknown," "Unknown," "test," and "Test," everything else is selected. Note that there are instances of "test" and "unknown" that are attached to other words that are not excluded, which is by design. We only want to remove exact matches.

Let's break down this code.

RegEx always starts and ends with slashes, which is why you see those at the beginning and end:

Next, the "\b" begins the word boundary. It says that everything that follows is a word of its own, until the final "\b" that ends the boundary. Note that there's a set of inner boundaries inside of this one.

Next, you have the opening and closing parentheses. These are saying that anything contained within are supposed to be evaluated together.

After that, you have the "?!" which is a look-forward statement saying that what follows is not true. It makes the rest of the statement that follows flip from truthy to falsey.

Next, you have the aforementioned inner word boundary. This is saying that the word that follows is the word (or words, in this case) that we're searching for. It's closed out after all the words in our or statement.

After that are the words that we're actually searching. The first is "unknown," but you may notice a familiar operator after that which is the vertical pipe. In RegEx, it means the same thing as it does in a lot of coding languages, it's an "or" statement. So the following line is saying that we're searching for "unknown," "Unknown," "test," or "Test."

The final piece is the "\w+" near the end of the expression. This matches one or more word or numerical characters, such as a-z, A-Z, 0-9, and _.

There's an excellent RegEx quick reference guide that can help you navigate RegEx on the Rubular site, here's what it looks like:

Conclusion

This rather confusing-looking piece of code has a lot of power behind it. There are two things I learned that require a high level of knowledge and precision during phase 3 at Flatiron School. The first is RegEx which we covered a bit of today, and the second is scraping using Nokogiri. More on that second topic another day. I hope you enjoyed this small tutorial about RegEx!