Zero | One

Posted on Oct 17, 2019 • Edited on Aug 19, 2020

Regular Expressions(RegEx) in Python

#python #regex #computerscience

Any developer or a wanna-be developer knows how useful regular expressions can be while dealing with strings and text. But generally, regex is considered to be tough and people don't want to get their hands dirty with it. At least it was the case for me until I found this Quora thread. The most upvoted answer in this quora question, in a way, provides a path to learn, understand and practice regex quickly and in a systematic manner. I have learnt from that path and I am going to share what I have learnt so that anyone can find all the information in a single place.

So, What is Regex?

Regular expressions are the group of characters which represent a pattern to be matched by strings. What this means is that usually, text documents like log files and phone book etc. contain many characters, words and we may want to extract meaningful information from those strings. Regular expressions come into play here. Regular expressions filter out specific data from given strings according to your needs and its up to you what you want to do with those extracted characters.
For example, in many web applications which require a user to sign up into their website, while creating password there may be some constraints as shown below:

password should:
contain at least 6 alphanumeric characters,
start with an uppercase letter, 
end with at least one special symbol(*,@,# etc.)

In such case we can create a pattern through regular expressions as shown below:

'^[A-Z]\w{5}.*[*@#]$'

What above symbols mean is all about regex. So, There is no need to worry at this point. if you enter password violating above rules, your password won't match with regular expression and password cannot be created as shown below:

Create a password: 1223as //-->password cannot be created!!
Create a password: A123bc@# //-->password created!!

The regex written above defined some pattern and if any string did not match that pattern it did not allow password creation. Likewise, there can be many uses of regex to filter out text and extract specific information from text and do whatever we desire to do with them.

Working with Regex(using Python3)

First things first, regex is generally used with some programming languages to operate on text from given text file or log file etc. We are going to use python to work with regex. One needs to be a little familiar with python syntax, the concept of modules and OOP to work with it. To use regular expressions in python we need to import module re. re module has many methods which facilitate working with regex. We will quickly look at the most important methods with which can perform almost every actions related to regex. While explaining these methods I will use regex in its simplest terms, that is, without any metacharacter. Metacharacters are special characters which are used in regular expressions. It will make us realize that regular expressions are string types object and not an only complicated piece of symbols that usually beginners think they are. In the simplest form, they can be used as simple string objects as we will see. Main methods that we are going to look at are:

1. re.search()

This, in my view, is the simplest and best way to work with regex in python. In fact, if done wisely, you can do almost every kind of string filtering and searching with this function alone. This method has the following signature:

re.search(pattern,string,flags)
where the pattern is the regex written by you
the string is a string with which we want to match our pattern
flags are optional third arguments used for different purposes briefly discussed below

Let's see how this works.

import re
res=re.search(r'coding','dev is the best place to learn coding.')
print(res) 
res=re.search(r'code','dev is the best place to learn coding.')
print(res)

you can run above code and see the result yourself but for reference the output of the above code is:

Output:
_sre.SRE_Match object; span=(31, 37), match='coding'
None

re.search() returns what is known as match object as shown above, if the regex matches with the string. if the regex match object is returned it indicates that pattern was matched with string and if no such pattern is present we get 'None' as a result. If we want to see the content of the match object we can use the group() method.So, replace print(res) statement in above code with print(res.group()). Try it on IDE yourself and the output will be:

Output:
Runtime Errors:
Traceback (most recent call last):
  File "/home/f79b34f7380e5d299ca6998dec4c6fef.py", line 6, in 
    print(res.group())
AttributeError: 'NoneType' object has no attribute 'group'

A runtime error occurs as "None" has no content and we are trying to access it. We will see more of search() and group() later.

2. re.match()

re.match() is a python specific method of re which is similar to search() except that it returns match object or in other words, it matches the pattern with string only if the given pattern is present at the beginning of the string. So, it is a special case of search() in which we are looking for a pattern only at the beginning of the string. Following example clarifies everything

# match() method of re module
import re
s1='China is the most populous country in the world'
s2='Most populous country in the world is China'
res=re.match(r'China',s1)
print(res) #prints Match object as China is at the beginning of s2.
res=re.match(r'China',s2)
print(res) #prints None as China is not at the beginning of s2.
if re.search('China',s1) and re.search('China',s2):
# another way to use search() and match()
print('search possible in s1 and s2')

Above code shows another way to use search() and match() methods. If search happens and pattern is matched we can do something based on that.

matching start and end

We saw that match() method returns match object only when the pattern occurs at the beginning of the test string. However, we have another way to achieve the same thing and that is with metacharacter '^'. If we want our pattern to occur at the beginning of the test string only then we can use ^ at the beginning of pattern as shown below. We use '^' with search() method. Its similarity with the match is shown below. Instead of using the match we can use search as shown below:

# replacing match using search and ^
import re
s1='China is the most populous country in the world'
s2='Most populous country in the world is China'
res=re.search(r'^China',s1)
print(res) #prints Match object as China is at the beginning of the s1.
res=re.search(r'^China',s2)
print(res) #prints None as China is not at the beginning of the s2.

Output:
<_sre.SRE_Match object; span=(0, 5), match='China'>
None

Similarly, we have '$' metacharacter which is placed at the end of regex and the pattern matches only if given pattern is present at the end of the test string. Example:

# Using $ to match last characterset in the test string
import re
s1='China is the most populous country in the world'
s2='Most populous country in the world is China'
res=re.search(r'China$',s1)
print(res) #prints None as China is not at the end of s1.
res=re.search(r'China$',s2)
print(res) #prints Match object as China is at the end of s2.

results are just reverse now as s2 has 'China' at the end of the sentence.

Output:
None
<_sre.SRE_Match object; span=(38, 43), match='China'>

If we combine both '^' and '$' in a pattern the test string will match if it contains whatever is between '^' and '$' and nothing else. Example:

import re
s1='China is the most populous country in the world'
s2='Most populous country in the world is China'
s3 = 'China'
res=re.search(r'^China$',s1)
print(res) #prints None as s1 does not start and end with 'China'.
res=re.search(r'^China$',s2)
print(res) #prints None as s2 does not start and end with 'China'.
res=re.search(r'^China$',s3)
print(res) #prints Match object as test string only contains "China" and nothing else.

Output:
None
None
<_sre.SRE_Match object; span=(0, 5), match='China'>

3. re.findall()

search() and match() methods only return a single matched substring in the form of the match object. But there can be many cases in which more than one substrings in the given string may match the regex pattern for example:

import re
res=re.search(r'.at','the cat was chasing rat while the bat was looking at them')
print(res.group())

Where '.' is a metacharacter which represents any character accept newline. Until now we have used simple ascii characters with regex but regex has many metacharacters too which allows us to create patterns which can do more powerful things. We will see all essential metacharacters in regex later. In the above case, only cat is recorded so the output will be 'cat'. But what if we want such words, then either we can use loops with search() or we can use findall() as shown below:

import re
res=re.findall(r'.at','the cat was chasing rat while bat was looking at them')
print(res)

Output is:

Output:
['cat', 'rat', 'bat', ' at']

So, findall() returns the list of all non-overlapping matches of the patterns.

4. re.compile()

With re.compile(), we can use the regex pattern again and again in our code. The signature of re.compile is:

re.compile(pattern,flags)
flags are optional arguments which we will see later.

the compile() returns an object known as regex object which has its own search(), match(),findall() and other methods that generally a match object has. So, below is the code which shows how compile() works and alternative way to work without compile() is also shown. There is no big difference between the two.

# compile method of re
import re
comp=re.compile('pattern')  #comp is a regex object not a match object
res=comp.search('the pattern is present in this sentence.') #res is match object
print(res,res.group())
res=re.search('pattern','the pattern is present in this sentence.') #alternate way without compile, here res is match object
print(res,res.group())

Output:
_sre.SRE_Match object; span=(4, 11), match='pattern' pattern
_sre.SRE_Match object; span=(4, 11), match='pattern' pattern

optional flag argument in re methods

There is an optional flag argument in all re pattern matching methods explained below like search(), match(),findall() and compile().
The flags can change the way regular expressions function. For that we will see one example:

import re
res = re.findall(r'Noun','Noun is the naming word. We use noun to name objects', re.IGNORECASE)
print(res)

Output:
['Noun', 'noun']

Without 3rd argument it matches only one instance of 'Noun' but when we use re.IGNORECASE(also re.I) it ignores the case and matches both instance of 'noun'. Likewise there are other flag values also. If you want to learn more about them you can refer this link

Syntax of Regular expressions

Now that we have seen how to work with regex in python using the search(), match() and findall(), let's dive deep into the syntax of regular expressions. It is called syntax of regular expressions because regex contains many metacharacters having their own meaning which makes regex very powerful tool to extract and filter information from text data. Until now we have seen how regex can be used in python but only using simple strings. Regex is rarely used that way. Actually regex comes with many metacharacters as described below one by one.

1. Any Charatcer(.)

The '.' is a metacharacter which matches anything in the test string(string in which we are looking for patterns) except a newline. As we saw before in example of findall() above we use '.' when we do not know beforehand which character we want to search for, for example, let the test string be

"the house no. is 74-3B."

and we are required to look if there is a house no. according to the number rule that says that- the house no. contains two characters followed by a '-' and that followed by two characters then we can write a regex as

"..-.."//matches test string mentioned above

As any character can occupy place around '-' we used '.'

2. Character Class([])

When we write [] in regex it represents character class and it matches only one out of several characters inside the square brackets. If we put '^' inside the square brackets before all the characters then it matches any character that is not in square brackets. if we put a hyphen(-) inside square brackets we can use it to match the range of characters in some sequence. The following list of examples will clarify these:

Character Class Example	Description
[aA]	matches 'a' or 'A'
[a-z]	matches any of the characters ‘a’,’b’,’c’, ’d’…. or ‘z’
[a-zA-Z]	matches one character which lies in range ‘a’ to ‘z’ or ‘A’ to ‘Z’.
[-abc]	matches ‘-‘ or ‘a’ or ‘b’ or ‘c’ i.e putting ‘-‘ as a standalone character has no special meaning.
[^ab]	matches anything but ‘a’ or ‘b’.
[^a-z]	matches anything but a character in the range a-z.
[a^b]	matches ‘a’ or ‘^’ or ‘b’ i.e putting ‘^’ at any place apart from the first place inside square brackets makes it like any other standalone characters.

3. Predefined Character Classes

With character classes, we can create any custom character class as we desire but regex also comes with some predefined character classes ready to use as mentioned below:

Character Class Example	Description
\d	matches any numerical digit.Equivalent to [0-9].
\D	matches any non-digit character. it is a complement to \d.Equivalent to [^0-9].
\s	matches any white space character.Equivalent to [\t\n\r\f\v].
\S	matches any character which is not white space.equivalent to [^\t\n\r\f\v].
\w	matches any alphanumeric character also called as a word character.Equivalent to[a-zA-Z0-9_] plus characters defined in current locale.
\W	matches anything but alphanumeric characters. Matches non-word characters.Equivalent to [^a-zA-Z0-9_]
\b	matches empty strings at the beginning and end of a word.
\B	matches empty strings but not at the beginning and end of a word.
\\	matches backslash(‘\’) character.
\.	matches period(‘.’) character.

4. Quantifiers

Quantifiers are very important metacharacters in the regex. As their name suggests, they are used to repeat characters the desired number of times. Quantifiers are always used after some character or group of characters and define how many times that character will repeat. There are basically 4 types of quantifiers that we will see.

a. ?(for optional preceding character)

It makes the preceding character or group of characters optional. for example

import re
res=re.findall(r"army?", "the arms and ammunitions should be provided to the army for security")
print(res)

b. *(zero or more repetitions)

It is used if we want the preceding character or group of characters to repeat zero or more times. For example:

import re
res = re.findall(r'ai*m', 'am aim aiim aiiims ai')
print(res)

Output:
['am', 'aim', 'aiim', 'aiiim']

Above code shows that regex matches if there is an arbitrary number of i's between a and m.

c. +(one or more repetitions)

It matches only if the preceding character to it occurs at least one or more times. For example:

import re
res = re.findall(r'ai+m', 'am aim aiim aiiims ai')
print(res)

Output:
['aim', 'aiim', 'aiiim']

We can see that for same test string, only those strings match which have at least one 'i' betweeb 'a' and 'm'.

d. {x}(fixed no. of repetitions)

Using + and * is nice but when we use them the preceding characters can repeat any no. of times. In some situations we want some characters or group of characters to repeat exactly desired no. of times. This can be achieved using {x} which matches only if preceding characters are repeated x no. of times. For example:

import re
print('Enter your phone no. it must be of 10 digits!!!')
phone = input()
res = re.search(r'\d{10}', phone)
if res:
    print("correct")
else:
    print("Incorrect no. of digits")

If the user enters anything less than or more than 10 digits above he or she will receive the message as "Incorrect no. of digits". Output when the phone no. is valid and invalid is shown below:

#valid
Output:
Enter your phone no. it must be of 10 digits!!!
9999999999
correct

#invalid
Output:
Enter your phone no. it must be of 10 digits!!!
100
Incorrect no. of digits

e. {x,y}(fixed and ranged repetitions):

In the above example, we can see that 100 is evaluated as invalid because of our code. But 100 is infact, a valid number. So, sometimes we may want restrictions in range. Then we use {x,y} which matches only if preceding characters repeat at least x times and at most y times. Example:

import re
print('Enter your phone no. it should be between 3 to 10 digits!!!')
phone = input()
res = re.search(r'\d{3,10}', phone)
if res:
    print("correct")
else:
    print("Incorrect no. of digits")

Output:
Enter your phone no. it should be between 3 to 10 digits!!!
100
correct

If we write {, y } it means 0 to y no. of preceding characters are allowed and if we write {x, } it means at least x characters and more no. of characters are allowed.

5. Grouping and Capturing

Next piece of syntax in regex involves groups and capturing groups. Grouping is one of the best features of regex. We can create a group by wrapping our regular expressions around parentheses (). If we use quantifiers after a group, it applies to the whole group and not to a single character. Let's understand grouping with an example:

import re
for _ in range(3):
    print('Enter your phone no. it must be of 10 digits!!!. ISD code is optional.')
    phone = input()
    res = re.search(r'(\+\d{2})?\d{10}', phone)
    if res:
        print("correct")
    else:
        print("Incorrect no. of digits")

Output:
Enter your phone no. it must be of 10 digits!!!. ISD code is optional.
9999999999
correct
Enter your phone no. it must be of 10 digits!!!. ISD code is optional.
+919999999999
correct
Enter your phone no. it must be of 10 digits!!!. ISD code is optional.
988
Incorrect no. of digits

Above example shows how groups work in the regex. To make the ISD code optional we could have used '?' operator character by character but groups make it very convenient as it treats all these characters as one unit. We grouped the characters and made it optional by applying '?' operator at once. However, this is not all that we can do with groups. we can capture groups and use them later in our programs. This makes regex very powerful. We can extract and desired strings and pattern and do some processing with them.

Before looking at how capturing is done lets first see another method that re module provides us.

group()

Until now we have been using match objects with if statements only but what if we want to see what characters were actually matched by the given test string? For this purpose, we use the group() method which was discussed briefly before. The group() method is applied on match object returned by search() or match() as discussed above. Let's see an example:

import re
res=re.search(r'coding','dev is the best place to learn coding.')
print(res)
print(res.group())

Output:
<_sre.SRE_Match object; span=(31, 37), match='coding'>;
coding

From the above output, we can see that when group method is used in it's simplest form without any arguments it simply returns the characters that were matched in a match object. When we try to print res itself as it is a match object we get <_sre.SRE_Match object; span=(31, 37), match='coding'> as output. But when we print res.group() we get the actual content of the match object that was matched.

Alternative matching in groups

We can use or operator with groups as shown below:

import re
regex = r'(Tom|Dick|Harry)'
res = re.search(regex, 'Tom is in the pattern')
res2 = re.search(regex, 'Dick is in the pattern')
res3 = re.search(regex, 'Harry is in the pattern')
res4 = re.search(regex, 'Sal is in the pattern')
print(res)
print(res2)
print(res3)
print(res4)

Output:
<_sre.SRE_Match object; span=(0, 3), match='Tom'>
<_sre.SRE_Match object; span=(0, 4), match='Dick'>
<_sre.SRE_Match object; span=(0, 5), match='Harry'>
None

From above example we can see that any pattern from the group matches.

Capturing and backreferences

With a group() we can capture expressions as shown above. It is capturing because it helps us to extract characters from matched objects which is useful as we will see. But the better use of group comes with arguments if we wrap some specific expression part inside a group(using paranthesis) we can get the content of the group using numbers as shown below:

import re
res=re.search(r'(cod)ing','gfg is the best place to learn coding.')
print(res)
print(res.group())
print(res.group(1))

Output:
<_sre.SRE_Match object; span=(31, 37), match='coding'>
coding
cod

As we can see after wrapping the character set 'cod' in paranthesis we formed the group (cod). So, to get all the matched characters we simply type res.group() and to get the contents of the group (cod) we type res.group(1). One natural question that arises is that why did we use 1 as an argument and nothing else? It will be clear when we will use multiple groups in the same regular expression. This is, in fact, another use of the group. To group different parts of expression and use them in our program conveniently. A simple example is shown below:

import re
res=re.search(r'(cod)(ing)','gfg is the best place to learn coding.')
print(res)
print(res.group())
print(res.group(1))
print(res.group(2))

Output:
<_sre.SRE_Match object; span=(31, 37), match='coding'>
coding
cod
ing

The use of number 1 and 2 as arguments becomes evident now. Numbers indicate the groups starting from left to right(leftmost group in the expression is assigned the value of 1 and then the value increases as we move towards right). And this process is what we call capturing. We can use groups to capture specific character sets and use them in our programs.
To see how capturing is useful, we will see an example borrowed from hackerrank. The problem is explained below:

We have been given the list of phone numbers of the format

[Country code]-[Local Area Code]-[Number]

Our job as a regex expert is to split it into country code, local area code and number and display them distinctly as shown below:

given number = 91-011-23413627
desired output after processing:
CountryCode=91,LocalAreaCode=011,Number=23413627

Some constraints are:

The number of numbers, N is in the range 1<=N<=20.
There might either be a '-' ( ascii value 45), or a ' ' ( space, ascii value 32) between the segments

We can solve above problem with the code below using groups and capturing:

import re
for i in range(int(input())):
    s=input()
    res=re.search('([0-9]{1,3})[- ]([0-9]{1,3})[- ]([0-9]{4,10})',s)
    if res:
        print("CountryCode={},LocalAreaCode={},Number={}".format(res.group(1),res.group(2),res.group(3)))

The code above uses many concepts that we have learnt until now. Let's see one by one:

The first for..in loop is used because the user is asked how many numbers they want to process.
For every number, we check whether the number string matches our pattern which must contain from 1 to 3 digits in the country code segment, from 1 to 3 digits in local area code segment and 4 to 10 digits in actual number segment.
We use groups to capture each segment.
We display each segment using the group method with arguments from 1 to 2 to 3 from left to right.

6. Backreferences

Backreferences are useful when we use the same text or character set again and again in the same expression. Whenever we create a group it is automatically stored in what we call as a backreference so that latter it can be used in program or expression itself. In fact, what we used as arguments in the group() method were actually backreferences. We have already seen one use of backreferences. That was to use captured groups in our program. Now we will see the use of backreferences inside our expressions. If we are looking for the same group of characters, again and again, we can use backreferences. Let's look at a regular expression for instance:

[A-Za-z]{3}\d{3}[A-Za-z]{3}

This pattern simply requires that there should be 3 alphabetical characters followed by three numerical digits followed by 3 alphabetical letters again. But we don't need to write whole regex for alphabetical letters again and if that same regex is required more times it will become lengthier to write same regex again and again. Instead we can use backreferences by grouping the first occurence as shown below:

([A-Za-z]){3}\d{3}\1{3}

'\1' in the above example represents backreference of the group ([A-Za-z]) we can replace all the other occurrences of that group with stored backreference '\1'. Numbering is done the same as group method starting from 1 and increasing numbers from left to right.

Search and replace with re.sub() method

The final method that we will see before closing this tutorial is sub() method of re module which helps us to search for specific text in the test string and replace it with the desired text. The signature of sub() method is:

re.sub(pattern, replacement, string)
where the pattern is the regex written by you
replacement is the string which we want to replace our pattern with
the string is a test string

One simple example is shown below:

import re
test = "people in indie are called indiens"
res = re.sub(r'indie', 'India', test)
print(res)

Output:
people in India are called Indians

Conclusion

That's it. This is the end of this somewhat long tutorial. I don't claim in any way that this tutorial is complete and I have covered everything that can be done with regex in python but basics are pretty much covered. After reading this tutorial I would recommend doing the following:

Read some more topics in detail like word boundaries(\b and \B), splitting with regular expressions, more flags. But apart from these, this tutorial has covered all the basics.
Practice regex syntax from this site
practice using regex with python from hackerrank regex track

Follow these three steps after this tutorial and you will have decent enough knowledge to apply regex not only in python but any other language. As they say "practice makes permanent".

Cheers!!

Credits(Cover Image):Image by msandersmusic from Pixabay

Top comments (2)

Kat 🐆🐾 • May 6 '20 • Edited

Hey! I love your post, it's very helpful!

I just wanted to point out that there are some "bugs" in your documentation that confused me at first, e.g. in this code block:

# replacing match using search and ^
import re
s1='China is the most populous country in the world'
s2='Most populous country in the world is China'
res=re.search(r'China$',s1)
print(res) #prints Match object as China is at the beginning of s2.
res=re.search(r'China$',s2)
print(res) #prints None as China is not at the beginning of s2.

(I don't mean to complain, just help make your blog post "flawless" cause I find it really helpful!)

Zero | One • May 8 '20

Thank you so much for reading the article and pointing my mistakes to me. It really means a lot to me. As it was written in one sitting, possibilities are there might be more mistakes. If you or anyone finds out, please point out to me. Some of the mistakes have been corrected as pointed out by you.