Understanding Regular Expressions once and for all [PART 1]

#regex

Originally published at: Codegram's blog

Did you ever try to learn Regular Expressions and felt a bit overwhelmed when getting confronted with a bunch of characters? Characters that – in correct order – are able to create any kind of search pattern you were looking for? If so, believe me, you are not alone. And good news: After this series, all the question marks in your head will be gone and you might start using regular expressions asking your friends for their dinner plans. (-> send them a link, so they can answer)

WHAT’S THE FUSS ABOUT

To get an idea of how regular expressions are structured, let’s have a look at a very simple example: Let’s imagine we have the following sentence:

Hummus is a dip made from chickpeas, tahini (optional), lemon, garlic and some more spices.

We want to check, if chickpeas are an ingredient of Hummus, so let’s write our very first regular expression: /chickpea/. That’s it. That’s a fully working and acceptable regular expression. The important part is, how you’ll read it: Do not read the word itself but each character separated. Start at the very beginning:
/ implies we’ll write a regular expression. Followed by c, by h, an i, c, k, p, e, a and finally another /, that shows the end of our pattern. Et voilà, we’ll get a match because 'chickpea' consists literally of the regular expression we just wrote. And with that, you know one very important set of regular expressions: literal characters. So whenever you see a character without any other symbol (/, ^, $, … ), it’s a literal character. The fun starts with the next group though: the special character group.

WHEN YOU ENTER A RESTAURANT, TAKE YOUR HAT OFF

Let’s imagine, our text not only shows the explanation what Hummus is, but maybe has a whole food list. But we are only interested in Hummus, because it’s so damn yummy. So, we are looking for the very first character(s) of each line. To do so, we’ll use a special character: the hat: ^. Instead of writing /Hummus/, we’ll add the tiny hat in front of our literal characters: /^Hummus/. Like this, it doesn’t matter if one of the other foods maybe are made of Hummus, our pattern will only filter the word Hummus at the beginning of a line:

Hummus is a dip made from chickpeas, tahini (optional), lemon, garlic and some more spices. | Origin: Middle East

Falafel is a deep-fried ball (or patty), made from ground chickpeas, herbs, spices and onions and gets served with Hummus. | Origin: Egypt

Miso soup consists of a stock called “dashi” mixed with miso paste. | Origin: Japan

Tofu is made by coagulating soy milk and then pressing the resulting curds into blocks of varying softness. | Origin: China

Pay attention that the Hummus inside the Falafel description doesn’t get matched!

BEFORE YOU LEAVE, PAY!

Now that we’ve seen how to catch the beginning of a line, what about the end of one? What, if we want to get all the food from Japan only? We just could write our literal character expression (/Japan/) but maybe 'Japan' is included in the description part of the list as well. That’s not what we want. Also, if you use the literal character expression, you would get a match for “Japanese” as well. Again, that’s not what we want. We want to find “Japan” at the end of a line.
Coming back to our restaurant situation: we enjoyed a fancy dinner and what do we have to do before leaving? Right, paying the bill. And here comes our next special regular expression character: $ - the dollar sign. While the hat character gets added at the beginning of the expression, the dollar sign gets attached to the end of it like so: /Japan$/. And with that, you would be able to match Japan in the third row of our example.

Now that we know how to catch something at the beginning of a line (^) and at the end of a line ($), let’s have a look how we can match some pattern in the whole text. I mentioned that with this regular expression /Japan/, we would find each “Japan” inside our given list. But we would also match 'Japanese' because 'Japan' is part of it. If we just want to search for 'Japan', we need the word boundary character that looks like this: \b. This special character is different to the previous ones as it consists of two characters: \ and b. You’ll see the backslash in the next parts more often. It is a special character on its own and in this case it indicates that the following literal character 'b' is a special character, too. Another difference to the hat and dollar character is, that it not only appears once but twice: before and after the word you are trying to find. In our case, if we want to match “Japan” only, our expression could look like this: /\bJapan\b/. I said it’s gonna be fun, right?

COOKING TIME

Reading about regular expressions is one thing, writing a totally different one. So let’s jump to our kitchen and prepare some nice regular expressions. Although there are different tools out there to test regular expressions, we’ll keep it simple and use our favourite editor for now (mine is Visual Studio Code). Create a new file 'hungry-regex.md' and copy the food list from above. Open the search bar (⌘+F) and click this magical sign: .*. That sign enables searching with regular expressions (Note: you don’t need to add / at the start and end of your regular expression!). Now, let’s try out some scenarios:

Write a regular expression to find the characters 'is' inside the whole list.
Write a regular expression to find the characters 'Miso' at the beginning of a line.
Write a regular expression to find the characters 'Yemen' at the end of a line.
Write a regular expression to find the word 'is' inside the whole list.

Well done! Stay tuned for the next part where we'll check out some more special characters like \w, \s, \d,.

Photo by George Pagan III on Unsplash